GitHub Repository Here

Background and Question

Question

What are the most significant risk factors associated with hospital readmission rate in hip/knee replacement patients?

Hypothesis

Hospitals with better Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) scores will have lower readmission rates for hip/knee replacements because higher patient satisfaction often correlates with better overall care quality and patient outcomes, including reduced complications and better post-discharge support (Edwards et al., 2015).

Prediction

It is anticipated that patient satisfaction, clinical treatment quality, and operational efficiency will be the most important risk factor categories associated with hospital readmission rates among patients who have had hip or knee replacement surgery. The most important variables influencing readmission rates for patients after hip or knee replacement surgery will most likely be a combination of high HCAHPS scores and reduced rates of clinical complications.

Methods

Our preprocessing and feature engineering approach was designed to address the complexities highlighted during exploratory data analysis (EDA). The initial plan involved several key steps:

Preprocessing: We focused on handling missing values, encoding categorical variables, and reducing collinearity. The preprocessing steps included dummy encoding of ordinal data, mapping categorical states to numeric codes, and removing variables with high missingness or redundancy. This approach aimed to ensure that the dataset was clean and suitable for further analysis. We specifically addressed the high number of missing values and collinearity issues as identified in our EDA.

Feature Engineering: We engineered features to consolidate related data and reduce dimensionality. For instance, we created a composite variable for overall mortality scores and removed redundant or highly correlated variables. This strategy aimed to simplify the feature set while retaining critical information, thus enhancing model performance and interpretability.

Initial Model Selection: Based on our EDA findings, we planned to explore multiple methods of unsupervised learning and employ a random forest model. Unsupervised methods, such as KNN for segmentation analysis, were chosen to identify clusters and uncover underlying patterns in the data. The random forest model was selected as an initial supervised learning approach due to its robustness in handling diverse data types and managing feature importance.

Preprocessing and Feature Selection

To ensure the dataset’s suitability for analysis, several preprocessing steps were employed. First, categorical variables were identified, and appropriate encoding methods were applied. The EDV column, which contains ordinal data, was dummy encoded based on predefined levels, transforming it into a numeric format that reflects its ordinal nature. This approach preserves the inherent order of the data while making it suitable for statistical modeling. Additionally, categorical states were encoded using a numeric mapping to eliminate potential ordinal relationships and standardize the representation of states. This choice was made to facilitate model compatibility and avoid any unintended ordinal implications in the analysis.

Unnecessary or highly correlated variables were then removed to streamline the dataset. Variables with significant missingness or redundancy, such as certain patient survey ratings and summary statistics, were excluded. The decision to remove these variables was guided by their high correlation with other variables, high missingness, or redundancy with already included metrics. For instance, combining various patient survey metrics into a single comprehensive score helped reduce collinearity and simplify the feature set. See table 1 for a list of all variables that were removed and justification for doing so.

Table 1: Justifications for Removal of Variables
Variable Justification
OP_18c Removed due to high correlation and low relevance.
OP_22 Removed due to high correlation and low relevance.
ED_2_Strata_1 Removed due to high percentage of missingness.
OP_23 Removed due to high percentage of missingness.
VTE_2 Removed due to high percentage of missingness.
STK_02 Removed as stroke data is not relevant to Hip/Knee Surgery.
STK_05 Removed as stroke data is not relevant to Hip/Knee Surgery.
STK_06 Removed as stroke data is not relevant to Hip/Knee Surgery.
HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE Removed due to strong correlation with overall hospital rating.
ExcessReadmissionRatio_HIP-KNEE Removed due to high correlation with target variable.
ExpectedReadmissionRate_HIP-KNEE Removed due to high correlation with target variable.
NumberOfReadmissions_HIP-KNEE Removed as it is influenced by hospital size, which is not available.
Sepsis Variables Removed due to unclear definition in dataset dictionary.
Score_PSI_90 Removed because it is a summary of other PSI variables, making it redundant.
Patient Survey Data Removed due to collinearity and redundancy with other patient survey metrics.

Feature engineering focused on creating meaningful variables from existing data and removing irrelevant or redundant features. A notable example is the creation of the Score_Ovr_MORT variable, which aggregates multiple mortality-related scores into a single measure. This new feature consolidates related variables, reducing dimensionality while preserving critical information on mortality rates.

Initial Model

For our initial model, we chose the Random Forest algorithm due to its robustness and ability to handle a large number of input variables without overfitting. Random Forests are ensemble learning methods that construct multiple decision trees during training and output the mode of the classes (classification) or mean prediction (regression) of the individual trees. This approach helps in managing the complexities and interactions between numerous features in our dataset. We used a grid search with 7-fold cross-validation to tune the mtry parameter, which determines the number of variables considered at each split in the trees.

Results

Preprocessing and Feature Engineering

First, categorical variables were appropriately encoded, and redundant variables were removed, as described in the previous section. Collinearity was greatly reduced, as evident by the heat maps in Figure 1 and Figure 2. Additionally, all rows with NA values for the target variables were removed. We also decided to remove the one facility that had an NA value, which happened to be the same observation with a missing state value. Variables with less than 5% missing data were imputed by the median, as this has a small impact on the variation of the predictors. Any variables with greater than 5% missing values were imputed using kNN to allow for more accurate imputation and mitigate reduction in variation. All steps were repeated on the test dataset, which is the most recent snapshot of the data from April 24, 2024. Finally, descriptive statistics were computed for all variables (Table 2, Figure 3).

Table 2. Descriptive Statistics for All Numeric Variables in Final Dataset
mean sd mad min max range skew kurtosis se
PredictedReadmissionRate_HIP_KNEE 4.546561e+00 0.9093914 0.8576841 1.9279 8.569 6.6411 0.4373061 0.4908058 0.0212407
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE 8.672450e+01 3.7230744 2.9652000 66.0000 98.000 32.0000 -0.5851721 1.5050011 0.0869602
EDV 2.694490e+00 1.0250451 1.4826000 1.0000 4.000 3.0000 -0.1385316 -1.1578643 0.0239421
HCP_COVID_19 8.887098e+01 9.5606361 8.3025600 32.6000 100.000 67.4000 -1.3148000 2.1219028 0.2233087
IMM_3 7.932570e+01 17.7478481 16.3086000 4.0000 100.000 96.0000 -1.0958217 0.7241943 0.4145381
OP_18b 1.895739e+02 49.6783664 44.4780000 62.0000 587.000 525.0000 0.9817063 3.3080181 1.1603422
OP_29 9.215876e+01 11.7440127 4.4478000 0.0000 100.000 100.0000 -3.3057142 14.4641144 0.2743060
SAFE_USE_OF_OPIOIDS 1.557229e+01 4.2502679 2.9652000 0.0000 45.000 45.0000 0.5890847 3.1474800 0.0992739
VTE_1 9.107365e+01 9.1561257 4.4478000 5.0000 100.000 95.0000 -4.3773341 29.2741519 0.2138605
Score_COMP_HIP_KNEE 3.179924e+00 0.5539759 0.4447800 1.6000 6.200 4.6000 0.7280978 1.7214783 0.0129393
Score_PSI_03 5.878123e-01 0.5183737 0.2816940 0.0500 6.310 6.2600 3.5027779 21.8351982 0.0121077
Score_PSI_04 1.688142e+02 19.0239898 15.6117780 86.6800 241.810 155.1300 -0.0656641 1.2542628 0.4443451
Score_PSI_06 2.467703e-01 0.0452557 0.0296520 0.1200 0.510 0.3900 0.9085914 1.9551905 0.0010570
Score_PSI_08 8.994540e-02 0.0082004 0.0000000 0.0600 0.130 0.0700 0.5502254 1.7580966 0.0001915
Score_PSI_09 2.514146e+00 0.4953103 0.3409980 1.1000 6.100 5.0000 1.2648156 4.8211377 0.0115690
Score_PSI_10 1.572951e+00 0.3731798 0.1482600 0.4700 4.550 4.0800 1.8287437 7.4056616 0.0087164
Score_PSI_11 8.969864e+00 3.1123203 2.2239000 2.7300 49.000 46.2700 2.7479837 22.3393252 0.0726947
Score_PSI_12 3.582755e+00 0.7944605 0.6671700 1.6100 7.510 5.9000 0.9512312 1.6446910 0.0185563
Score_PSI_13 5.287103e+00 1.0312666 0.7413000 2.1700 10.790 8.6200 1.0438233 2.7498825 0.0240874
Score_PSI_14 2.010540e+00 0.3569335 0.1927380 1.0700 4.400 3.3300 1.9129489 6.4857308 0.0083369
Score_PSI_15 1.102668e+00 0.3271882 0.2223900 0.3500 3.430 3.0800 1.7127063 5.4101230 0.0076422
FacilityName* 9.026263e+02 517.8533090 667.1700000 1.0000 1796.000 1795.0000 -0.0093756 -1.2029068 12.0955473
State* 2.469449e+01 14.4448000 20.7564000 1.0000 50.000 49.0000 0.0222112 -1.3657016 0.3373885
Payment_PAYM_90_HIP_KNEE 2.105666e+04 1943.7230953 1716.8508000 15936.0000 34916.000 18980.0000 0.7645293 1.9173982 45.3997188
StateCode* 2.472995e+01 14.5368326 20.7564000 1.0000 50.000 49.0000 0.0160810 -1.3497989 0.3395381
Score_Ovr_MORT 1.305795e+01 1.1615024 1.0378200 8.1600 17.420 9.2600 -0.0667113 0.5511229 0.0271293

Segmentation Analysis

For unsupervised machine learning, we employed both k-means clustering and hierarchical clustering to explore the underlying structure of the dataset. For k-means clustering, we first standardized the numeric features and used the elbow method to determine the optimal number of clusters, which was found to be three. The k-means algorithm was then applied, and cluster characteristics were analyzed by computing the mean of numeric features within each cluster. Visualizations of feature distributions across clusters and the overall cluster structure were created to gain insights into the clustering results (Figure 4).

Hierarchical clustering was also performed to validate the k-means results. We computed a distance matrix and applied Ward’s method to obtain cluster solutions. An analysis of within-cluster sum of squares (WCSS) for different cluster counts was conducted to determine the optimal number of clusters, which again suggested three clusters. The hierarchical clustering results were compared with those from k-means by visualizing clusters in the PCA-reduced feature space and analyzing cluster characteristics (Figure 5). Both clustering methods revealed distinct patterns, but the preliminary findings suggest that the clusters may not be highly meaningful due to potential issues with the data’s inherent structure. Overall, the preliminary clustering analysis indicates that while the k-means and hierarchical methods can segment the data, the results may not be sufficiently informative or actionable at this stage.

Initial Model

Algorithm

We used a Random Forest algorithm for our initial model due to its robustness and ability to handle both regression and classification tasks, while effectively managing multicollinearity and providing insights into feature importances. The model was trained using a grid search over different mtry values and 7-fold cross-validation to identify the best performing parameters. The model’s performance on the test set was evaluated using RMSE and R-squared metrics. The RMSE on the test set was 0.337, indicating that the model’s predictions were fairly accurate. The R-squared value of 0.863 shows that the model explains a significant portion of the variance in the predicted readmission rates.

Assumptions

Random Forests make minimal assumptions about the underlying data distribution, which makes them versatile. However, they assume that the data is independent and identically distributed, and the model benefits from a large number of features and observations. To validate these assumptions, we conducted residuals analysis, ensuring no patterns indicating violations of independence or homoscedasticity. The Durbin-Watson test was used to check for autocorrelation in the residuals, and residual plots were examined for any non-random patterns. The results showed that the residuals are relatively randomly distributed, with a slight right skew (Figure 6). However, the Q-Q plot indicates that there are more extreme values than should typically be expected in a normal distribution (Figure 7).

Overfitting

To address potential overfitting, we employed several strategies:

Validation with Training and Testing Sets: The model was validated using a separate test set, yielding an RMSE of 0.337 and an R-squared of 0.863, indicating good generalization to unseen data.

Cross-Validation: We used 7-fold cross-validation during the grid search to ensure the model’s performance was consistent across different subsets of the data.

Feature Importances

The feature importances, sorted by %IncMSE and IncNodePurity, highlight which variables had the most influence on the model’s predictions. These insights will guide further feature engineering and model refinement. Below are the top features by %IncMSE and IncNodePurity (Table 3):

Table 3. Feature Importances by %IncMSE and IncNodePurity
Feature IncMSE Feature2 IncNodePurity
Score_COMP_HIP_KNEE 9.7357100 Payment_PAYM_90_HIP_KNEE 123.8902502
Payment_PAYM_90_HIP_KNEE 9.6229143 Score_COMP_HIP_KNEE 114.4625628
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE 7.5556215 HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE 72.2391717
OP_18b 5.4698510 OP_18b 68.5356171
Score_Ovr_MORT 5.4077690 Score_PSI_15 60.6220998
EDV 5.1776922 Score_Ovr_MORT 60.2425197
Score_PSI_13 4.9000336 Score_PSI_10 59.5208962
Score_PSI_14 4.8856870 Score_PSI_03 59.4229012
StateCode009 4.8328505 Score_PSI_14 59.2358357
Score_PSI_15 4.2941699 Score_PSI_11 59.1724094
Score_PSI_10 4.0174622 Score_PSI_13 58.6569716
StateCode022 3.7888848 Score_PSI_12 56.8858338
StateCode017 3.7549687 Score_PSI_09 53.4640474
Score_PSI_03 3.5726312 HCP_COVID_19 52.0890212
Score_PSI_04 2.8913748 Score_PSI_04 51.8655680
StateCode025 2.7505145 SAFE_USE_OF_OPIOIDS 46.2049432
Score_PSI_06 2.6570681 VTE_1 46.0078517
Score_PSI_12 2.4948006 IMM_3 42.2476506
StateCode032 2.3802295 Score_PSI_06 41.7370371
SAFE_USE_OF_OPIOIDS 2.3714908 OP_29 36.6180186
HCP_COVID_19 2.2980513 EDV 29.8450844
VTE_1 2.2874745 Score_PSI_08 17.6795267
StateCode004 2.2496756 StateCode009 12.2425241
StateCode024 2.0447172 StateCode022 8.4435485
Score_PSI_11 2.0069998 StateCode017 6.4580507
StateCode046 2.0052216 StateCode035 6.1570394
StateCode043 1.9268220 StateCode014 5.9144268
StateCode042 1.8687236 StateCode043 5.8039853
StateCode007 1.8314433 StateCode030 5.1846769
StateCode012 1.7779685 StateCode013 4.7749396
OP_29 1.6823746 StateCode005 4.4827144
StateCode005 1.6764652 StateCode025 3.9203910
StateCode045 1.5536467 StateCode032 3.5868612
Score_PSI_09 1.4929911 StateCode023 3.2959928
StateCode041 1.4148578 StateCode038 3.0902474
StateCode031 1.2854191 StateCode049 2.5998141
StateCode019 1.2183283 StateCode010 2.5641148
Score_PSI_08 1.1995049 StateCode040 2.4285788
StateCode006 0.9132182 StateCode033 2.3972385
StateCode034 0.8838516 StateCode027 2.3772938
StateCode035 0.8077020 StateCode016 2.2735052
StateCode016 0.7077575 StateCode036 2.2682670
StateCode026 0.6510542 StateCode047 1.9595272
StateCode050 0.5220562 StateCode046 1.9184269
StateCode023 0.5218117 StateCode024 1.8304484
StateCode014 0.5193292 StateCode004 1.7659648
StateCode028 0.3499817 StateCode020 1.7213020
StateCode003 0.2716664 StateCode021 1.6954613
StateCode040 0.2708332 StateCode006 1.6759295
StateCode015 0.2574058 StateCode044 1.6302575
StateCode030 0.1470473 StateCode007 1.4674657
StateCode044 0.1060654 StateCode041 1.4299112
StateCode010 0.0924085 StateCode018 1.3573963
StateCode008 0.0000000 StateCode042 1.3312563
IMM_3 -0.0227214 StateCode015 1.0803806
StateCode038 -0.0273910 StateCode003 1.0239009
StateCode002 -0.0734093 StateCode002 0.8448698
StateCode027 -0.1735277 StateCode028 0.7556607
StateCode013 -0.3246193 StateCode019 0.7508580
StateCode036 -0.7108616 StateCode037 0.6948348
StateCode039 -0.7249834 StateCode011 0.6886992
StateCode033 -0.8669980 StateCode012 0.6481818
StateCode020 -0.9201799 StateCode026 0.5724423
StateCode018 -1.0773800 StateCode048 0.5437489
StateCode048 -1.1750056 StateCode029 0.5252000
StateCode011 -1.2182816 StateCode050 0.3265311
StateCode047 -1.2860386 StateCode034 0.2911008
StateCode037 -1.4565429 StateCode045 0.2510802
StateCode029 -1.9125539 StateCode031 0.2394702
StateCode049 -2.1395954 StateCode039 0.1262152
StateCode021 -2.1453124 StateCode008 0.0679025

Discussion and Next Steps

Our preprocessing and feature engineering strategy focused on addressing the data complexities identified during exploratory data analysis (EDA). We handled missing values through imputation and removed variables with high collinearity or redundancy to ensure a clean dataset suitable for modeling. Key preprocessing steps included dummy encoding of ordinal data, standardizing categorical states, and consolidating features to reduce dimensionality. This streamlined approach preserved critical information while simplifying the dataset, enhancing both model performance and interpretability.

For the initial model, we selected the Random Forest algorithm due to its strength in managing diverse data types and its ability to handle multicollinearity effectively. We used a grid search with 7-fold cross-validation to fine-tune the model parameters, achieving an RMSE of 0.337 and an R-squared of 0.863 on the test set, indicating strong predictive performance. We validated model assumptions by analyzing residuals for patterns of autocorrelation and homoscedasticity. Feature importances were evaluated to refine our feature set, guiding further model enhancement. The preprocessing, feature selection, and initial modeling efforts collectively aimed to build a robust predictive model with high generalizability.

Next, we will determine a threshold to categorize hospitals as “preferred” vs “non-preferred” based on the target variable, predicted readmission rate. We will then leverage L1 and L2 regularization via an Elastic Net model to determine significant risk factors. After using Elastic Net and Random Forest to find significant risk factors, these risk factors will be utilized in a neural network model and used to classify hospitals as either “preferred” or “non-preferred”. Classification ability of the neural network model will then be compared with the classification ability of the Random Forest model and Elastic Net model. All of these models will require extensive hyperparameter tuning, particularly the neural network model.

Appendix 1: Data Dictionary

Data Dictionary
* indicates measure utilized in final dataset
Measure_ID Description
FacilityId* Unique facility identifier.
FacilityName Name of the facility.
State State where the facility is located.
ExcessReadmissionRatio_HIP-KNEE The ratio of the predicted readmission rate to the expected readmission rate, based on an average hospital with similar patients. Performance is compared against a ratio of one, such that below one is better and above one is worse in terms of readmission rates.
PredictedReadmissionRate_HIP_KNEE* The number of readmissions within 30 days predicted based on the hospital’s performance with its observed case mix. The predicted number of readmissions is estimated using a hospital-specific intercept, and is intended to reflect the annual expected performance of the hospital given its historical case and patient mix and performance.
ExpectedReadmissionRate_HIP-KNEE The expected number of readmissions in each hospital is estimated using its patient mix and an average hospital-specific intercept. It is thus indirectly standardized to other hospitals with similar case and patient mixes.
NumberOfReadmissions_HIP-KNEE Crude number of readmissions in each hospital within 30 days.
H_HSP_RATING_LINEAR_SCORE* Overall hospital rating - linear mean score. Employs all survey response items in each HCAHPS measure and are converted and combined into a 0-100 linear-scaled measure score.
H_RECMND_LINEAR_SCORE Recommend hospital - linear mean score. From question: Would you recommend this hospital to your friends and family?
EDV* Emergency department volume. Number based on the volume of patients submitted by a hospital used for the measure OP-22: Left without Being Seen.
ED_2 Average (median) admit decision time to time of departure from the emergency department for emergency department patients admitted to inpatient status.
IMM_3* Healthcare workers given influenza vaccination.
HCP_COVID_19* COVID-19 vaccination coverage among healthcare providers.
OP_18b* Average (median) time patients spent in the emergency department before leaving from the visit.
OP_18c Average time patients spent in the emergency department before being sent home (Median Time from ED Arrival to ED Departure for Discharged ED Patients – Psychiatric/Mental Health Patients).
OP_22 Percentage of patients who left the emergency department before being seen.
OP_23 Percentage of patients who came to the emergency department with stroke symptoms who received brain scan results within 45 minutes of arrival.
OP_29* Percentage of patients receiving appropriate recommendation for follow-up screening colonoscopy.
SAFE_USE_OF_OPIOIDS* Percentage of patients who were prescribed 2 or more opioids or an opioid and benzodiazepine concurrently at discharge.
SEP_1 Severe sepsis and septic shock.
SEP_SH_3HR Septic shock 3 hour.
SEP_SH_6HR Septic shock 6 hour.
SEV_SEP_3HR Severe sepsis 3 hour.
SEV_SEP_6HR Severe sepsis 6 hour.
STK_02 Percentage of ischemic stroke patients prescribed or continuing to take antithrombotic therapy at hospital discharge.
STK_05 Percentage of ischemic stroke patients administered antithrombotic therapy by the end of hospital day 2.
STK_06 Percentage of ischemic stroke patients who are prescribed or continuing to take statin medication at hospital discharge.
VTE_1* Percentage of patients that received VTE prophylaxis after hospital admission or surgery.
VTE_2 Percentage of patients that received VTE prophylaxis after being admitted to the intensive care unit (ICU).
Score_COMP_HIP_KNEE* Rate of complications for hip/knee replacement patients.
Score_MORT_30_AMI Death rate for heart attack patients.
Score_MORT_30_COPD Death rate for chronic obstructive pulmonary disease (COPD) patients.
Score_MORT_30_HF Death rate for heart failure patients.
Score_MORT_30_PN Death rate for pneumonia patients.
Score_MORT_30_STK Death rate for stroke patients.
Score_Ovr_MORT* Summary measure (row-wise mean) of Score_MORT_30_AMI, Score_MORT_30_COPD, Score_MORT_30_HF, Score_MORT_30_PN, and Score_MORT_30_STK.
Score_PSI_03* Rate of pressure sores.
Score_PSI_04* Deaths among patients with serious treatable complications after surgery.
Score_PSI_06* Collapsed lung due to medical treatment.
Score_PSI_08* Broken hip from a fall after surgery.
Score_PSI_09* Postoperative hemorrhage or hematoma rate.
Score_PSI_10* Kidney and diabetic complications after surgery.
Score_PSI_11* Respiratory failure after surgery.
Score_PSI_12* Serious blood clots after surgery.
Score_PSI_13* Blood stream infection after surgery.
Score_PSI_14* A wound that splits open after surgery on the abdomen or pelvis.
Score_PSI_15* Accidental cuts and tears from medical treatment.
Score_PSI_90 Serious complications (this is a composite or summary measure).
Payment_PAYM_90_HIP_KNEE* Payment for hip/knee replacement - estimates of payments associated with a 90-day episode of care for hip/knee replacement.

Appendix 2: Preprocessing and Feature Engineering Code

Variable Encoding (SE)

Identify Categorical Variables

# Create function to find categorical variables
is_categorical <- function(x) is.factor(x) | is.character(x)

# Apply function to all variables in the dataset
categorical_vars <- sapply(HipKneeClean, is_categorical)

# Print the names of all categorical variables
categorical <- names(HipKneeClean)[categorical_vars]
categorical
## [1] "FacilityId"   "EDV"          "FacilityName" "State"

Dummy encode the EDV column

# Define the encoding mapping (ignore NAs for now)
encoding_map <- c(
  'low' = 1,
  'medium' = 2,
  'high' = 3,
  'very high' = 4
)
# Dummy encoding used due to ordinal nature of this data

# Create a copy of HipKneeClean and name it HipKneeTrain to separate cleaned dataset and the training dataset
HipKneeTrain <- HipKneeClean %>%
  mutate(EDV = recode(EDV, !!!encoding_map))

# Print first 20 rows of EDV column in HipKneeClean and HipKneeTrain to ensure proper encoding
cat("HipKneeClean")
## HipKneeClean
print(head(HipKneeClean$EDV, 20))
##  [1] "high"      "high"      "high"      "low"       "low"       "high"     
##  [7] "low"       "medium"    "low"       "medium"    "low"       "low"      
## [13] "high"      "high"      "very high" "very high" "low"       "high"     
## [19] "low"       "very high"
cat("HipKneeTrain")
## HipKneeTrain
print(head(HipKneeTrain$EDV, 20))
##  [1] 3 3 3 1 1 3 1 2 1 2 1 1 3 3 4 4 1 3 1 4

Encode each state in alphabetical order

# Manually map out each state with their respective code in alphabetical order with a preceding 0 to make data non-ordinal
state_mapping <- c(
  "AL" = "001",
  "AK" = "002",
  "AZ" = "003",
  "AR" = "004",
  "CA" = "005",
  "CO" = "006",
  "CT" = "007",
  "DE" = "008",
  "FL" = "009",
  "GA" = "010",
  "HI" = "011",
  "ID" = "012",
  "IL" = "013",
  "IN" = "014",
  "IA" = "015",
  "KS" = "016",
  "KY" = "017",
  "LA" = "018",
  "ME" = "019",
  "MD" = "020",
  "MA" = "021",
  "MI" = "022",
  "MN" = "023",
  "MS" = "024",
  "MO" = "025",
  "MT" = "026",
  "NE" = "027",
  "NV" = "028",
  "NH" = "029",
  "NJ" = "030",
  "NM" = "031",
  "NY" = "032",
  "NC" = "033",
  "ND" = "034",
  "OH" = "035",
  "OK" = "036",
  "OR" = "037",
  "PA" = "038",
  "RI" = "039",
  "SC" = "040",
  "SD" = "041",
  "TN" = "042",
  "TX" = "043",
  "UT" = "044",
  "VT" = "045",
  "VA" = "046",
  "WA" = "047",
  "WV" = "048",
  "WI" = "049",
  "WY" = "050"
)

# Create new "StateCode" column with the encoded values
HipKneeTrain <- HipKneeTrain %>%
  mutate(StateCode = state_mapping[State])

# Print 100 rows of the "State" and "StateCode" columns to ensure accuracy
print("State and StateCode Columns")
## [1] "State and StateCode Columns"
print(head(HipKneeTrain[c("State", "StateCode")], 100))
## # A tibble: 100 × 2
##    State StateCode
##    <chr> <chr>    
##  1 AL    001      
##  2 AL    001      
##  3 AL    001      
##  4 AL    001      
##  5 AL    001      
##  6 AL    001      
##  7 AL    001      
##  8 AL    001      
##  9 AL    001      
## 10 AL    001      
## # ℹ 90 more rows
# Print all unique values in "StateCode" column to ensure accuracy
print("Unique StateCode Values")
## [1] "Unique StateCode Values"
print(unique(HipKneeTrain$StateCode))
##  [1] "001" "002" "003" "004" "005" "006" "007" "008" NA    "009" "010" "011"
## [13] "012" "013" "014" "015" "016" "017" "018" "019" "020" "021" "022" "023"
## [25] "024" "025" "026" "027" "028" "029" "030" "031" "032" "033" "034" "035"
## [37] "036" "037" "038" "039" "040" "041" "042" "043" "044" "045" "046" "047"
## [49] "048" "049" "050"

Collinearity and Feature Removal (SE)

Remove correlated and unnecessary variables

# Specify columns to remove
columns_to_remove <- c(
  "ED_2_Strata_1",
  "OP_23",
  "VTE_2",
  "OP_18c",
  "OP_22",
  "STK_02",
  "STK_05",
  "STK_06",
  "HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE",
  "NumberOfReadmissions_HIP-KNEE",
  "ExcessReadmissionRatio_HIP-KNEE",
  "ExpectedReadmissionRate_HIP-KNEE",
  "SEP_1",
  "SEV_SEP_6HR",
  "SEV_SEP_3HR",
  "SEP_SH_6HR",
  "SEP_SH_3HR",
  "Score_PSI_90",
  "PatientSurveyStarRating_H_COMP_1_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_2_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_3_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_5_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_6_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_7_STAR_RATING",    
  "PatientSurveyStarRating_H_CLEAN_STAR_RATING",     
  "PatientSurveyStarRating_H_QUIET_STAR_RATING",     
  "PatientSurveyStarRating_H_HSP_RATING_STAR_RATING",
  "PatientSurveyStarRating_H_RECMND_STAR_RATING",    
  "PatientSurveyStarRating_H_STAR_RATING",
  "HcahpsLinearMeanValue_H_COMP_1_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_2_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_3_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_5_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_6_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_7_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_CLEAN_LINEAR_SCORE",     
  "HcahpsLinearMeanValue_H_QUIET_LINEAR_SCORE"
)

# Remove specified columns
HipKneeTrain <- HipKneeTrain %>% select(-all_of(columns_to_remove))
# Print column names to verify
print("Remaining columns:")
## [1] "Remaining columns:"
print(colnames(HipKneeTrain))
##  [1] "FacilityId"                                     
##  [2] "PredictedReadmissionRate_HIP-KNEE"              
##  [3] "HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE"
##  [4] "EDV"                                            
##  [5] "HCP_COVID_19"                                   
##  [6] "IMM_3"                                          
##  [7] "OP_18b"                                         
##  [8] "OP_29"                                          
##  [9] "SAFE_USE_OF_OPIOIDS"                            
## [10] "VTE_1"                                          
## [11] "Score_COMP_HIP_KNEE"                            
## [12] "Score_MORT_30_AMI"                              
## [13] "Score_MORT_30_COPD"                             
## [14] "Score_MORT_30_HF"                               
## [15] "Score_MORT_30_PN"                               
## [16] "Score_MORT_30_STK"                              
## [17] "Score_PSI_03"                                   
## [18] "Score_PSI_04"                                   
## [19] "Score_PSI_06"                                   
## [20] "Score_PSI_08"                                   
## [21] "Score_PSI_09"                                   
## [22] "Score_PSI_10"                                   
## [23] "Score_PSI_11"                                   
## [24] "Score_PSI_12"                                   
## [25] "Score_PSI_13"                                   
## [26] "Score_PSI_14"                                   
## [27] "Score_PSI_15"                                   
## [28] "FacilityName"                                   
## [29] "State"                                          
## [30] "Payment_PAYM_90_HIP_KNEE"                       
## [31] "StateCode"

“OP_18c” and “OP_22” removed due to being highly correlated and low relevance. “ED_2_Strata_1”, “OP_23”, and “VTE_2” removed due to high percentage of missingness. “STK_02”, “STK_05”, and “STK_06” variables removed as we do not see stroke data as being relevant towards Hip/Knee Surgery. “HcahpsLinearMeanValue_H_RECMND_LINEAR_ SCORE” removed as it is strongly correlated with overall hospital rating. “ExcessReadmissionRatio_HIP-KNEE”, and “ExpectedReadmissionRate_HIP-KNEE” removed due to being highly correlated with the target variable. “NumberOfReadmissions_HIP-KNEE” removed as this would be highly influenced by hospital size and we have no data on hospital sizes. Sepsis variables removed due to unclear definition in the dataset’s dictionary of what the values represent. Score_PSI_90 variable removed because it’s a summary of the other PSI variables. We chose to include all the individual PSI variables, which makes the summary variable redundant. We chose to remove a lot of the patient survey data due to collinearity and redundancy. The average star rating data is redundant with the linear mean score data. We decided to keep the overall hospital rating linear mean score, and the hospital recommendation linear mean score columns. We felt that these variables summarized the other, more granular, metrics. For example COMP-1 is nurse responsiveness and COMP-2 is doctor responsiveness, COMP-3 is staff responsiveness. It makes sense that a lot of these were collinear. We had considered engineering the comp features (1-7) together into a single patient experience variable, however, this was collinear with overall hospital rating and recommendation. We also chose to go with the linear mean score overall hospital rating and recommendation score, because the dataset essentially already scaled these variables for us by performing a linear transformation. It always seems a little scary removing entire chunks of variables, as we wouldn’t want to miss any significant relationships between the variables. Do you think this is a wise decision? Are there any other ideas you could think of to engineer variables in a way to keep more of them?

Reassess collinearity with heatmap and correlation matrix

# Compute correlation matrix
cor_matrix <- cor(HipKneeTrain %>% select_if(is.numeric), use = "pairwise.complete.obs")

# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)

# Plot heatmap
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Figure 5. Correlation Heatmap of Numeric Variables")

# Convert correlation matrix to df
cor_table <- as.data.frame(cor_matrix)

# Add variable names as a column
cor_table$Variable <- rownames(cor_table)

# Reorder columns
cor_table <- cor_table %>%
  select(Variable, everything())

# Print table
cor_table %>%
  kable(caption = "Table 8. Correlation Coefficients Table") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
Table 8. Correlation Coefficients Table
Variable PredictedReadmissionRate_HIP-KNEE HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE EDV HCP_COVID_19 IMM_3 OP_18b OP_29 SAFE_USE_OF_OPIOIDS VTE_1 Score_COMP_HIP_KNEE Score_MORT_30_AMI Score_MORT_30_COPD Score_MORT_30_HF Score_MORT_30_PN Score_MORT_30_STK Score_PSI_03 Score_PSI_04 Score_PSI_06 Score_PSI_08 Score_PSI_09 Score_PSI_10 Score_PSI_11 Score_PSI_12 Score_PSI_13 Score_PSI_14 Score_PSI_15 Payment_PAYM_90_HIP_KNEE
PredictedReadmissionRate_HIP-KNEE PredictedReadmissionRate_HIP-KNEE 1.0000000 -0.2060912 0.1986939 -0.0563082 -0.0028840 0.1295727 -0.0106510 0.1063002 0.0654668 0.3208550 0.0074065 -0.0794948 -0.1067828 -0.0985660 -0.0376746 -0.0037334 -0.0449077 0.0154891 -0.0214412 -0.0182303 0.0710046 0.1130121 0.1047402 0.1193336 0.0140012 -0.0158282 0.2975679
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE -0.2060912 1.0000000 -0.2262341 0.0302154 0.2182505 -0.2448842 0.0865028 0.1100002 -0.0248375 -0.1091775 -0.0952111 -0.0230632 0.0300940 -0.0702915 -0.0760295 -0.0499063 -0.0948401 -0.0054742 -0.0590998 0.0707774 -0.0157615 -0.1615833 -0.0674588 -0.1458284 -0.0172231 0.0304634 -0.2108956
EDV EDV 0.1986939 -0.2262341 1.0000000 0.1599806 0.0038674 0.5918897 0.0603877 -0.1223889 0.2992859 -0.0240093 -0.0687401 -0.0840621 -0.2588739 -0.1904351 -0.0754281 0.0292819 -0.0438108 -0.0308989 -0.1637017 -0.0181523 0.0837980 0.0455190 0.0749218 0.0724936 0.0399935 0.0144400 0.0053946
HCP_COVID_19 HCP_COVID_19 -0.0563082 0.0302154 0.1599806 1.0000000 0.3203622 0.2574291 0.1067941 -0.0812735 0.0241622 -0.0510683 -0.0869890 -0.1128278 -0.1245435 -0.1523779 -0.0988833 0.0943953 0.0417007 0.0225916 -0.0272232 0.0549990 0.0009222 -0.0909811 0.1091949 -0.0160408 0.0169512 0.0430812 -0.0627505
IMM_3 IMM_3 -0.0028840 0.2182505 0.0038674 0.3203622 1.0000000 0.1105628 0.1317922 0.0410289 0.0906329 -0.0212916 -0.0165321 -0.0616051 -0.0010634 -0.0761104 0.0146397 0.0451508 0.0601831 0.0544576 -0.0311579 0.0899361 0.0412713 -0.0625676 0.0594933 -0.0250714 0.0723872 0.0625753 -0.0720431
OP_18b OP_18b 0.1295727 -0.2448842 0.5918897 0.2574291 0.1105628 1.0000000 0.0506067 -0.1400845 0.2344307 -0.0293698 -0.0678837 -0.1527290 -0.2195933 -0.1858291 -0.0905644 0.0583644 0.0638412 0.0187794 -0.0806516 0.0224833 0.0544528 -0.0076797 0.1653812 0.0993570 0.0676972 0.0374530 -0.0241965
OP_29 OP_29 -0.0106510 0.0865028 0.0603877 0.1067941 0.1317922 0.0506067 1.0000000 -0.0650231 0.1567526 -0.0096464 -0.0600569 0.0081252 0.0099705 -0.0536184 -0.0251654 -0.0032584 0.0312780 0.0059688 -0.0199006 0.0150699 0.0331569 -0.0837910 -0.0108546 -0.0087678 0.0163779 0.0419068 -0.0815209
SAFE_USE_OF_OPIOIDS SAFE_USE_OF_OPIOIDS 0.1063002 0.1100002 -0.1223889 -0.0812735 0.0410289 -0.1400845 -0.0650231 1.0000000 -0.0563373 -0.0081923 -0.0643353 -0.0362573 0.0171605 -0.0204107 -0.0804707 -0.0287158 -0.0995591 -0.0603629 -0.0017558 -0.0043396 -0.0382369 0.0137283 -0.0380145 -0.0258097 -0.0166146 -0.0268805 -0.0048449
VTE_1 VTE_1 0.0654668 -0.0248375 0.2992859 0.0241622 0.0906329 0.2344307 0.1567526 -0.0563373 1.0000000 -0.0526925 -0.0493931 -0.0282911 -0.1051171 -0.1710235 -0.1238916 -0.0378500 -0.1180160 -0.0262166 -0.0522876 -0.0577213 -0.0037803 -0.0301775 -0.0317534 -0.0440578 0.0086141 0.0326658 -0.1256363
Score_COMP_HIP_KNEE Score_COMP_HIP_KNEE 0.3208550 -0.1091775 -0.0240093 -0.0510683 -0.0212916 -0.0293698 -0.0096464 -0.0081923 -0.0526925 1.0000000 0.0830479 -0.0203930 -0.0007242 0.0241066 0.0211621 0.0498557 0.0038509 0.0505415 0.0577776 0.0540124 0.0813038 0.1279724 0.1458258 0.1334619 0.0498603 0.0433809 0.3410864
Score_MORT_30_AMI Score_MORT_30_AMI 0.0074065 -0.0952111 -0.0687401 -0.0869890 -0.0165321 -0.0678837 -0.0600569 -0.0643353 -0.0493931 0.0830479 1.0000000 0.2498600 0.3407616 0.3309425 0.2222539 0.0415523 0.2105379 0.0885083 0.1010348 0.0889343 0.1066619 0.1037006 0.0492328 0.0467554 0.0454462 0.0297688 0.0591548
Score_MORT_30_COPD Score_MORT_30_COPD -0.0794948 -0.0230632 -0.0840621 -0.1128278 -0.0616051 -0.1527290 0.0081252 -0.0362573 -0.0282911 -0.0203930 0.2498600 1.0000000 0.3844105 0.3710744 0.2038243 -0.0069743 0.1713379 0.0478268 0.0397571 0.0429090 0.0320669 0.0426574 -0.0532586 0.0026944 0.0734846 0.0340007 -0.0406696
Score_MORT_30_HF Score_MORT_30_HF -0.1067828 0.0300940 -0.2588739 -0.1245435 -0.0010634 -0.2195933 0.0099705 0.0171605 -0.1051171 -0.0007242 0.3407616 0.3844105 1.0000000 0.4479367 0.3147371 0.0371596 0.2556384 0.0679149 0.1051698 0.0707269 0.0383771 0.0362529 -0.0300702 -0.0086832 0.0647245 0.0342374 -0.0350247
Score_MORT_30_PN Score_MORT_30_PN -0.0985660 -0.0702915 -0.1904351 -0.1523779 -0.0761104 -0.1858291 -0.0536184 -0.0204107 -0.1710235 0.0241066 0.3309425 0.3710744 0.4479367 1.0000000 0.3042563 0.0303815 0.2301195 0.0543554 0.0884315 0.0217880 0.0237048 0.0704445 0.0089560 0.0393676 0.0464407 0.0029691 -0.0062985
Score_MORT_30_STK Score_MORT_30_STK -0.0376746 -0.0760295 -0.0754281 -0.0988833 0.0146397 -0.0905644 -0.0251654 -0.0804707 -0.1238916 0.0211621 0.2222539 0.2038243 0.3147371 0.3042563 1.0000000 0.0687216 0.2380935 0.0878847 0.1014879 0.0674377 0.0622532 0.0725381 0.0474896 0.0513975 0.0492194 0.0625191 -0.0272101
Score_PSI_03 Score_PSI_03 -0.0037334 -0.0499063 0.0292819 0.0943953 0.0451508 0.0583644 -0.0032584 -0.0287158 -0.0378500 0.0498557 0.0415523 -0.0069743 0.0371596 0.0303815 0.0687216 1.0000000 0.1353085 0.0601750 0.0636661 0.1407342 0.0386211 0.0114365 0.1186788 0.0298580 0.0596798 0.0999683 0.0086745
Score_PSI_04 Score_PSI_04 -0.0449077 -0.0948401 -0.0438108 0.0417007 0.0601831 0.0638412 0.0312780 -0.0995591 -0.1180160 0.0038509 0.2105379 0.1713379 0.2556384 0.2301195 0.2380935 0.1353085 1.0000000 0.0601419 0.0870693 0.1059485 0.0523892 0.0649032 0.0782559 0.0123489 0.0652098 0.1018205 -0.0766302
Score_PSI_06 Score_PSI_06 0.0154891 -0.0054742 -0.0308989 0.0225916 0.0544576 0.0187794 0.0059688 -0.0603629 -0.0262166 0.0505415 0.0885083 0.0478268 0.0679149 0.0543554 0.0878847 0.0601750 0.0601419 1.0000000 0.0724291 0.1014588 0.0516246 0.0351464 0.1431056 0.0509831 0.0527115 0.0910520 0.0456525
Score_PSI_08 Score_PSI_08 -0.0214412 -0.0590998 -0.1637017 -0.0272232 -0.0311579 -0.0806516 -0.0199006 -0.0017558 -0.0522876 0.0577776 0.1010348 0.0397571 0.1051698 0.0884315 0.1014879 0.0636661 0.0870693 0.0724291 1.0000000 0.0052449 -0.0360093 0.0198090 0.0394605 0.0093444 0.0228045 0.0127268 -0.0041983
Score_PSI_09 Score_PSI_09 -0.0182303 0.0707774 -0.0181523 0.0549990 0.0899361 0.0224833 0.0150699 -0.0043396 -0.0577213 0.0540124 0.0889343 0.0429090 0.0707269 0.0217880 0.0674377 0.1407342 0.1059485 0.1014588 0.0052449 1.0000000 0.0885278 0.0680540 0.1732337 0.0519119 0.1207438 0.2197254 -0.0237660
Score_PSI_10 Score_PSI_10 0.0710046 -0.0157615 0.0837980 0.0009222 0.0412713 0.0544528 0.0331569 -0.0382369 -0.0037803 0.0813038 0.1066619 0.0320669 0.0383771 0.0237048 0.0622532 0.0386211 0.0523892 0.0516246 -0.0360093 0.0885278 1.0000000 0.1626632 0.1079488 0.2303938 0.0453739 0.0830134 0.0497447
Score_PSI_11 Score_PSI_11 0.1130121 -0.1615833 0.0455190 -0.0909811 -0.0625676 -0.0076797 -0.0837910 0.0137283 -0.0301775 0.1279724 0.1037006 0.0426574 0.0362529 0.0704445 0.0725381 0.0114365 0.0649032 0.0351464 0.0198090 0.0680540 0.1626632 1.0000000 0.1172504 0.2506376 -0.0093577 0.0464067 0.1441986
Score_PSI_12 Score_PSI_12 0.1047402 -0.0674588 0.0749218 0.1091949 0.0594933 0.1653812 -0.0108546 -0.0380145 -0.0317534 0.1458258 0.0492328 -0.0532586 -0.0300702 0.0089560 0.0474896 0.1186788 0.0782559 0.1431056 0.0394605 0.1732337 0.1079488 0.1172504 1.0000000 0.1742084 0.0522204 0.1358951 0.0655557
Score_PSI_13 Score_PSI_13 0.1193336 -0.1458284 0.0724936 -0.0160408 -0.0250714 0.0993570 -0.0087678 -0.0258097 -0.0440578 0.1334619 0.0467554 0.0026944 -0.0086832 0.0393676 0.0513975 0.0298580 0.0123489 0.0509831 0.0093444 0.0519119 0.2303938 0.2506376 0.1742084 1.0000000 0.0056987 0.0878105 0.0949467
Score_PSI_14 Score_PSI_14 0.0140012 -0.0172231 0.0399935 0.0169512 0.0723872 0.0676972 0.0163779 -0.0166146 0.0086141 0.0498603 0.0454462 0.0734846 0.0647245 0.0464407 0.0492194 0.0596798 0.0652098 0.0527115 0.0228045 0.1207438 0.0453739 -0.0093577 0.0522204 0.0056987 1.0000000 0.1176726 -0.0181150
Score_PSI_15 Score_PSI_15 -0.0158282 0.0304634 0.0144400 0.0430812 0.0625753 0.0374530 0.0419068 -0.0268805 0.0326658 0.0433809 0.0297688 0.0340007 0.0342374 0.0029691 0.0625191 0.0999683 0.1018205 0.0910520 0.0127268 0.2197254 0.0830134 0.0464067 0.1358951 0.0878105 0.1176726 1.0000000 -0.0467071
Payment_PAYM_90_HIP_KNEE Payment_PAYM_90_HIP_KNEE 0.2975679 -0.2108956 0.0053946 -0.0627505 -0.0720431 -0.0241965 -0.0815209 -0.0048449 -0.1256363 0.3410864 0.0591548 -0.0406696 -0.0350247 -0.0062985 -0.0272101 0.0086745 -0.0766302 0.0456525 -0.0041983 -0.0237660 0.0497447 0.1441986 0.0655557 0.0949467 -0.0181150 -0.0467071 1.0000000

Imputation and Handling of Missing Values (AC)

# Change - to _ in HIP-KNEE
colnames(HipKneeTrain) <- gsub("-", "_", colnames(HipKneeTrain))

# Remove all NA values in target variable "PredictedReadmissionRate_HIP_KNEE"
HipKneeTrain <- HipKneeTrain %>% filter(!is.na(PredictedReadmissionRate_HIP_KNEE))

# Remove all NA values in the "State", "StateCode", and "FacilityName" columns
HipKneeTrain <- HipKneeTrain %>% drop_na(State, StateCode, FacilityName)


# Print number of remaining variables and observations
dimensions <- dim(HipKneeTrain)
cat("Number of variables:", dimensions[2], "\n")
## Number of variables: 31
cat("Number of observations:", dimensions[1], "\n")
## Number of observations: 1833

We decided to remove the one facility that had an NA value, which also happened to be the same observation with a missing state value.

# Calculate missing values
missing_values_summary <- HipKneeTrain %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  mutate(Missing_Percentage = (Missing_Count / nrow(HipKneeTrain)) * 100)

# Print table
missing_values_summary %>%
  kable(caption = "Table 7. Missing Values Summary") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
Table 7. Missing Values Summary
Variable Missing_Count Missing_Percentage
FacilityId 0 0.0000000
PredictedReadmissionRate_HIP_KNEE 0 0.0000000
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE 33 1.8003273
EDV 90 4.9099836
HCP_COVID_19 16 0.8728860
IMM_3 16 0.8728860
OP_18b 75 4.0916530
OP_29 222 12.1112930
SAFE_USE_OF_OPIOIDS 69 3.7643208
VTE_1 994 54.2280415
Score_COMP_HIP_KNEE 40 2.1822149
Score_MORT_30_AMI 405 22.0949264
Score_MORT_30_COPD 247 13.4751773
Score_MORT_30_HF 141 7.6923077
Score_MORT_30_PN 125 6.8194217
Score_MORT_30_STK 284 15.4937261
Score_PSI_03 8 0.4364430
Score_PSI_04 575 31.3693399
Score_PSI_06 2 0.1091107
Score_PSI_08 2 0.1091107
Score_PSI_09 2 0.1091107
Score_PSI_10 41 2.2367703
Score_PSI_11 40 2.1822149
Score_PSI_12 2 0.1091107
Score_PSI_13 42 2.2913257
Score_PSI_14 87 4.7463175
Score_PSI_15 29 1.5821058
FacilityName 0 0.0000000
State 0 0.0000000
Payment_PAYM_90_HIP_KNEE 42 2.2913257
StateCode 0 0.0000000

Impute variables with low percentage missingness (<5%) by the median for numeric variables and mode for categorical variables

# Calculate median for columns with <5% missing values
numeric_vars_low_missing <- c("HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE", "EDV", "HCP_COVID_19", "IMM_3", "OP_18b", "SAFE_USE_OF_OPIOIDS", "Score_COMP_HIP_KNEE", "Score_PSI_03", "Score_PSI_06", "Score_PSI_08", "Score_PSI_09", "Score_PSI_10", "Score_PSI_11", "Score_PSI_12", "Score_PSI_13", "Score_PSI_14", "Score_PSI_15", "Payment_PAYM_90_HIP_KNEE")

for (var in numeric_vars_low_missing) {
  HipKneeTrain[[var]][is.na(HipKneeTrain[[var]])] <- median(HipKneeTrain[[var]], na.rm = TRUE)
}

Impute high percentage missingness variables (>5%) using KNN

# Select high missingness variables for KNN imputation
vars_for_knn <- c("VTE_1", "Score_MORT_30_AMI", "Score_MORT_30_COPD", "Score_MORT_30_HF", "Score_MORT_30_PN", "Score_MORT_30_STK", "Score_PSI_04", "OP_29")

# Perform KNN imputation
HipKneeTrain_knn <- kNN(HipKneeTrain, variable = vars_for_knn, k = 5)

# Remove columns created by the KNN function
HipKneeTrain_knn <- HipKneeTrain_knn %>% select(-ends_with("_imp"))

# Update HipKneeTrain with imputed values
HipKneeTrain[vars_for_knn] <- HipKneeTrain_knn[vars_for_knn]
# Calculate missing values
missing_values_summary <- HipKneeTrain %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  mutate(Missing_Percentage = (Missing_Count / nrow(HipKneeTrain)) * 100)

# Print table
missing_values_summary %>%
  kable(caption = "Table 7. Missing Values Summary") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
Table 7. Missing Values Summary
Variable Missing_Count Missing_Percentage
FacilityId 0 0
PredictedReadmissionRate_HIP_KNEE 0 0
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE 0 0
EDV 0 0
HCP_COVID_19 0 0
IMM_3 0 0
OP_18b 0 0
OP_29 0 0
SAFE_USE_OF_OPIOIDS 0 0
VTE_1 0 0
Score_COMP_HIP_KNEE 0 0
Score_MORT_30_AMI 0 0
Score_MORT_30_COPD 0 0
Score_MORT_30_HF 0 0
Score_MORT_30_PN 0 0
Score_MORT_30_STK 0 0
Score_PSI_03 0 0
Score_PSI_04 0 0
Score_PSI_06 0 0
Score_PSI_08 0 0
Score_PSI_09 0 0
Score_PSI_10 0 0
Score_PSI_11 0 0
Score_PSI_12 0 0
Score_PSI_13 0 0
Score_PSI_14 0 0
Score_PSI_15 0 0
FacilityName 0 0
State 0 0
Payment_PAYM_90_HIP_KNEE 0 0
StateCode 0 0

Feature Engineer Mortality Data (AC)

# Average death rates amongst mortality variables and create new column "Score_Ovr_MORT"
HipKneeTrain$Score_Ovr_MORT <- rowMeans(HipKneeTrain[, c("Score_MORT_30_AMI", 
                                                         "Score_MORT_30_COPD", 
                                                         "Score_MORT_30_HF", 
                                                         "Score_MORT_30_PN", 
                                                         "Score_MORT_30_STK")], 
                                                          na.rm = TRUE)

# Remove old mortality columns
HipKneeTrain <- HipKneeTrain[, !(names(HipKneeTrain) %in% c("Score_MORT_30_AMI", 
                                                            "Score_MORT_30_COPD",
                                                            "Score_MORT_30_HF", 
                                                            "Score_MORT_30_PN", 
                                                            "Score_MORT_30_STK"))]

Reassess heatmap with engineered mortality data

# Compute correlation matrix
cor_matrix <- cor(HipKneeTrain %>% select_if(is.numeric), use = "pairwise.complete.obs")

# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)

# Plot heatmap
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Figure 5. Correlation Heatmap of Numeric Variables")

Save the data for future ease of use

save(HipKneeTrain, file = "HipKneeTrain.RData")

Preprocessing the test dataset (AC)

We are utilizing the most recent snapshot from 04/24/2024 as our test set. Utilizing this brand new data will help to ensure that our model is generalizable and useful for future analyses.

Loading the Data

# Set the directory for the data files
filepath <- "/Users/adelinecasali/Desktop/hospitals_04_2024/" 

# List the files in the directory that have "Hospital.csv"
files <- list.files(path = filepath, pattern = "Hospital.csv")

# Iterate through each file in the list
for(f in 1:length(files)) {
  
# Read the CSV, clean column names to upper camel case, and store in "dat"
    dat <- clean_names(read_csv(paste0(filepath, files[f]),
                                show_col_types = FALSE), 
                       case = "upper_camel")
    
# Remove ".Hospital.csv" part of the file names to create variable name
    filename <- gsub(".Hospital\\.csv", "", files[f])
    
# Assign data to a variable with the above created name
    assign(filename, dat)
}
# Create a df of file names without ".Hospital.csv"
files <- gsub(".Hospital\\.csv", "", files) %>% data.frame()

# Set column name of the df to "File Name"
names(files) <- "File Name"

files %>% 
  kable(
    format = "html",
    caption = "Table 1. List of hospital-level data files.") %>%
    kable_styling(bootstrap_options = c("striped", full_width = F)
  )
Table 1. List of hospital-level data files.
File Name
Complications_and_Deaths
FY_2024_HAC_Reduction_Program
FY_2024_Hospital_Readmissions_Reduction_Program
HCAHPS
Healthcare_Associated_Infections
Maternal_Health
Medicare_Hospital_Spending_Per_Patient
Outpatient_Imaging_Efficiency
Payment_and_Value_of_Care
Timely_and_Effective_Care
Unplanned_Hospital_Visits

Exploring and Preprocessing the FY_2024_Hospital_Readmissions_Reduction_Program dataset (AC)

Viewing and checking for missing values

# Display first 10 rows of FY_2024_Hospital_Readmissions_Reduction_Program 
head(FY_2024_Hospital_Readmissions_Reduction_Program,10)
## # A tibble: 10 × 12
##    FacilityName         FacilityId State MeasureName NumberOfDischarges Footnote
##    <chr>                <chr>      <chr> <chr>       <chr>                 <dbl>
##  1 SOUTHEAST HEALTH ME… 010001     AL    READM-30-H… N/A                      NA
##  2 SOUTHEAST HEALTH ME… 010001     AL    READM-30-H… 616                      NA
##  3 SOUTHEAST HEALTH ME… 010001     AL    READM-30-A… 274                      NA
##  4 SOUTHEAST HEALTH ME… 010001     AL    READM-30-P… 404                      NA
##  5 SOUTHEAST HEALTH ME… 010001     AL    READM-30-C… 126                      NA
##  6 SOUTHEAST HEALTH ME… 010001     AL    READM-30-C… 117                      NA
##  7 MARSHALL MEDICAL CE… 010005     AL    READM-30-A… N/A                       1
##  8 MARSHALL MEDICAL CE… 010005     AL    READM-30-C… 137                      NA
##  9 MARSHALL MEDICAL CE… 010005     AL    READM-30-P… 285                      NA
## 10 MARSHALL MEDICAL CE… 010005     AL    READM-30-H… 129                      NA
## # ℹ 6 more variables: ExcessReadmissionRatio <chr>,
## #   PredictedReadmissionRate <chr>, ExpectedReadmissionRate <chr>,
## #   NumberOfReadmissions <chr>, StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- FY_2024_Hospital_Readmissions_Reduction_Program %>%
  select_if(is.numeric)

# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## Footnote 
##    12077

Replacing values with NA and “Too Few to Report” values with “5”

# Use the function "replace_with_na_all()" to replace aberrant values with NA
FY_2024_Hospital_Readmissions_Reduction_Program <- replace_with_na_all(FY_2024_Hospital_Readmissions_Reduction_Program, condition = ~ .x == "N/A")

# Replace "Too Few to Report" values with "5" in using gsub
FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions <- gsub("Too Few to Report", "5", FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions)

# Check first 10 rows to confirm that it worked
head(FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions, 10)
##  [1] "5"   "149" "32"  "68"  "11"  "20"  NA    "14"  "40"  "24"
# NumberOfReadmissions had to be converted to numeric before applying integers
FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions <- as.numeric(FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions)

# Find all values of "5" in NumberOfReadmissions
fives <- which(FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions == 5)

# Replace values of "5" with random integers from 1 - 10
FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions[fives] <- sample(1:10, length(fives), replace = TRUE)

# Check the first 20 rows to see if this was applied correctly
head(FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions,20)
##  [1]   8 149  32  68  11  20  NA  14  40  24   1  NA   1  21  15  83  36  75  10
## [20]  NA

Converting columns to numeric

# Selecting the columns to convert
columns_to_convert <- c("NumberOfDischarges", "ExcessReadmissionRatio", "PredictedReadmissionRate", "ExpectedReadmissionRate", "NumberOfReadmissions")

# Use mutate_at to convert the specified columns to numeric
FY_2024_Hospital_Readmissions_Reduction_Program <- FY_2024_Hospital_Readmissions_Reduction_Program %>%
  mutate_at(vars(one_of(columns_to_convert)), as.numeric)

# Print the structure of the dataframe to check the changes
str(FY_2024_Hospital_Readmissions_Reduction_Program)
## tibble [18,774 × 12] (S3: tbl_df/tbl/data.frame)
##  $ FacilityName            : chr [1:18774] "SOUTHEAST HEALTH MEDICAL CENTER" "SOUTHEAST HEALTH MEDICAL CENTER" "SOUTHEAST HEALTH MEDICAL CENTER" "SOUTHEAST HEALTH MEDICAL CENTER" ...
##  $ FacilityId              : chr [1:18774] "010001" "010001" "010001" "010001" ...
##  $ State                   : chr [1:18774] "AL" "AL" "AL" "AL" ...
##  $ MeasureName             : chr [1:18774] "READM-30-HIP-KNEE-HRRP" "READM-30-HF-HRRP" "READM-30-AMI-HRRP" "READM-30-PN-HRRP" ...
##  $ NumberOfDischarges      : num [1:18774] NA 616 274 404 126 117 NA 137 285 129 ...
##  $ Footnote                : num [1:18774] NA NA NA NA NA NA 1 NA NA NA ...
##  $ ExcessReadmissionRatio  : num [1:18774] 0.892 1.1 0.933 0.987 0.952 ...
##  $ PredictedReadmissionRate: num [1:18774] 3.53 23.13 12.9 17.05 9.81 ...
##  $ ExpectedReadmissionRate : num [1:18774] 3.96 21.02 13.83 17.28 10.31 ...
##  $ NumberOfReadmissions    : num [1:18774] 8 149 32 68 11 20 NA 14 40 24 ...
##  $ StartDate               : chr [1:18774] "07/01/2019" "07/01/2019" "07/01/2019" "07/01/2019" ...
##  $ EndDate                 : chr [1:18774] "06/30/2022" "06/30/2022" "06/30/2022" "06/30/2022" ...

Removing excess text from measure names

FY_2024_Hospital_Readmissions_Reduction_Program <-  FY_2024_Hospital_Readmissions_Reduction_Program %>%
  mutate(MeasureName = gsub("READM-30-", "", MeasureName)) %>% 
  mutate(MeasureName = gsub("-HRRP", "", MeasureName)) 

Pivoting the data wider

readmissionsClean <- FY_2024_Hospital_Readmissions_Reduction_Program %>%
  pivot_wider(
    names_from = MeasureName, 
    values_from = c(NumberOfDischarges, ExcessReadmissionRatio, PredictedReadmissionRate, ExpectedReadmissionRate, NumberOfReadmissions), 
    id_cols = c(FacilityName, FacilityId, State, StartDate, EndDate)
  )

# Check the new dataframe
dim(readmissionsClean)
## [1] 3129   35
head(readmissionsClean)
## # A tibble: 6 × 35
##   FacilityName         FacilityId State StartDate EndDate NumberOfDischarges_H…¹
##   <chr>                <chr>      <chr> <chr>     <chr>                    <dbl>
## 1 SOUTHEAST HEALTH ME… 010001     AL    07/01/20… 06/30/…                     NA
## 2 MARSHALL MEDICAL CE… 010005     AL    07/01/20… 06/30/…                     NA
## 3 NORTH ALABAMA MEDIC… 010006     AL    07/01/20… 06/30/…                     NA
## 4 MIZELL MEMORIAL HOS… 010007     AL    07/01/20… 06/30/…                     NA
## 5 CRENSHAW COMMUNITY … 010008     AL    07/01/20… 06/30/…                     NA
## 6 ST. VINCENT'S EAST   010011     AL    07/01/20… 06/30/…                     NA
## # ℹ abbreviated name: ¹​`NumberOfDischarges_HIP-KNEE`
## # ℹ 29 more variables: NumberOfDischarges_HF <dbl>,
## #   NumberOfDischarges_AMI <dbl>, NumberOfDischarges_PN <dbl>,
## #   NumberOfDischarges_CABG <dbl>, NumberOfDischarges_COPD <dbl>,
## #   `ExcessReadmissionRatio_HIP-KNEE` <dbl>, ExcessReadmissionRatio_HF <dbl>,
## #   ExcessReadmissionRatio_AMI <dbl>, ExcessReadmissionRatio_PN <dbl>,
## #   ExcessReadmissionRatio_CABG <dbl>, ExcessReadmissionRatio_COPD <dbl>, …

Filtering for only hip/knee conditions

readmissionsClean <- readmissionsClean %>%
  select(FacilityName, FacilityId, State, matches("HIP-KNEE$"))

Exploring and Preprocessing the HCAHPS dataset

Viewing and checking for missing values

# Display first 10 rows of HCAHPS 
head(HCAHPS,10)
## # A tibble: 10 × 22
##    FacilityId FacilityName           Address CityTown State ZipCode CountyParish
##    <chr>      <chr>                  <chr>   <chr>    <chr> <chr>   <chr>       
##  1 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  2 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  3 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  4 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  5 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  6 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  7 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  8 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  9 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
## 10 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
## # ℹ 15 more variables: TelephoneNumber <chr>, HcahpsMeasureId <chr>,
## #   HcahpsQuestion <chr>, HcahpsAnswerDescription <chr>,
## #   PatientSurveyStarRating <chr>, PatientSurveyStarRatingFootnote <dbl>,
## #   HcahpsAnswerPercent <chr>, HcahpsAnswerPercentFootnote <chr>,
## #   HcahpsLinearMeanValue <chr>, NumberOfCompletedSurveys <chr>,
## #   NumberOfCompletedSurveysFootnote <chr>, SurveyResponseRatePercent <chr>,
## #   SurveyResponseRatePercentFootnote <chr>, StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- HCAHPS %>%
  select_if(is.numeric)

# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## PatientSurveyStarRatingFootnote 
##                          430641

Removing footnote columns and replacing NA values

# Removing all footnote columns
HCAHPS <- HCAHPS %>%
  select(-ends_with("footnote"))

# Replacing all "Not Applicable" with NA
HCAHPS <- as.data.frame(sapply(HCAHPS, function(x) {
  if (is.character(x)) {
    x[x == "Not Applicable"] <- NA
  }
  return(x)
}))

# Replacing all "Not Available" with NA
HCAHPS <- as.data.frame(sapply(HCAHPS, function(x) {
  if (is.character(x)) {
    x[x == "Not Available"] <- NA
  }
  return(x)
}))

Pivoting the data wider

HCAHPSClean <- HCAHPS %>%
  pivot_wider(
    names_from = HcahpsMeasureId, 
    values_from = c(PatientSurveyStarRating, HcahpsAnswerPercent, HcahpsLinearMeanValue, SurveyResponseRatePercent), 
    id_cols = c(FacilityName, FacilityId, State)
  )

# Check the new dataframe
dim(HCAHPSClean)
## [1] 4814  375
head(HCAHPSClean)
## # A tibble: 6 × 375
##   FacilityName    FacilityId State PatientSurveyStarRat…¹ PatientSurveyStarRat…²
##   <chr>           <chr>      <chr> <chr>                  <chr>                 
## 1 SOUTHEAST HEAL… 010001     AL    <NA>                   <NA>                  
## 2 MARSHALL MEDIC… 010005     AL    <NA>                   <NA>                  
## 3 NORTH ALABAMA … 010006     AL    <NA>                   <NA>                  
## 4 MIZELL MEMORIA… 010007     AL    <NA>                   <NA>                  
## 5 CRENSHAW COMMU… 010008     AL    <NA>                   <NA>                  
## 6 ST. VINCENT'S … 010011     AL    <NA>                   <NA>                  
## # ℹ abbreviated names: ¹​PatientSurveyStarRating_H_COMP_1_A_P,
## #   ²​PatientSurveyStarRating_H_COMP_1_SN_P
## # ℹ 370 more variables: PatientSurveyStarRating_H_COMP_1_U_P <chr>,
## #   PatientSurveyStarRating_H_COMP_1_LINEAR_SCORE <chr>,
## #   PatientSurveyStarRating_H_COMP_1_STAR_RATING <chr>,
## #   PatientSurveyStarRating_H_NURSE_RESPECT_A_P <chr>,
## #   PatientSurveyStarRating_H_NURSE_RESPECT_SN_P <chr>, …

Exploring and Preprocessing the Timely_and_Effective_Care dataset

Viewing and checking for missing values

# Display first 10 rows of Timely_and_Effective_Care
head(Timely_and_Effective_Care,10)
## # A tibble: 10 × 16
##    FacilityId FacilityName           Address CityTown State ZipCode CountyParish
##    <chr>      <chr>                  <chr>   <chr>    <chr> <chr>   <chr>       
##  1 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  2 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  3 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  4 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  5 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  6 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  7 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  8 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  9 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
## 10 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
## # ℹ 9 more variables: TelephoneNumber <chr>, Condition <chr>, MeasureId <chr>,
## #   MeasureName <chr>, Score <chr>, Sample <chr>, Footnote <chr>,
## #   StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- Timely_and_Effective_Care %>%
  select_if(is.numeric)

# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## named list()

Replacing NA values

# Replacing all "Not Applicable" with NA
Timely_and_Effective_Care <- as.data.frame(sapply(Timely_and_Effective_Care, function(x) {
  if (is.character(x)) {
    x[x == "Not Applicable"] <- NA
  }
  return(x)
}))

# Replacing all "Not Available" with NA
Timely_and_Effective_Care <- as.data.frame(sapply(Timely_and_Effective_Care, function(x) {
  if (is.character(x)) {
    x[x == "Not Available"] <- NA
  }
  return(x)
}))

Pivoting the data wider

careClean <- Timely_and_Effective_Care %>%
  pivot_wider(
    names_from = MeasureId, 
    values_from = c(Score), 
    id_cols = c(FacilityName, FacilityId, State)
  )

# Check the new dataframe
dim(careClean)
## [1] 4677   26
head(careClean)
## # A tibble: 6 × 26
##   FacilityName   FacilityId State EDV   ED_2_Strata_1 ED_2_Strata_2 HCP_COVID_19
##   <chr>          <chr>      <chr> <chr> <chr>         <chr>         <chr>       
## 1 SOUTHEAST HEA… 010001     AL    high  <NA>          <NA>          80.7        
## 2 MARSHALL MEDI… 010005     AL    high  148           105           79.8        
## 3 NORTH ALABAMA… 010006     AL    high  <NA>          <NA>          79          
## 4 MIZELL MEMORI… 010007     AL    low   <NA>          <NA>          57.9        
## 5 CRENSHAW COMM… 010008     AL    low   <NA>          <NA>          81.2        
## 6 ST. VINCENT'S… 010011     AL    high  <NA>          <NA>          88          
## # ℹ 19 more variables: IMM_3 <chr>, OP_18b <chr>, OP_18c <chr>, OP_22 <chr>,
## #   OP_23 <chr>, OP_29 <chr>, OP_31 <chr>, SAFE_USE_OF_OPIOIDS <chr>,
## #   SEP_1 <chr>, SEP_SH_3HR <chr>, SEP_SH_6HR <chr>, SEV_SEP_3HR <chr>,
## #   SEV_SEP_6HR <chr>, STK_02 <chr>, STK_03 <chr>, STK_05 <chr>, STK_06 <chr>,
## #   VTE_1 <chr>, VTE_2 <chr>

Exploring and Preprocessing the Complications_and_Deaths dataset

Viewing and checking for missing values

# Display first 10 rows of Complications_and_Deaths
head(Complications_and_Deaths,10)
## # A tibble: 10 × 18
##    FacilityId FacilityName           Address CityTown State ZipCode CountyParish
##    <chr>      <chr>                  <chr>   <chr>    <chr> <chr>   <chr>       
##  1 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  2 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  3 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  4 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  5 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  6 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  7 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  8 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  9 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
## 10 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
## # ℹ 11 more variables: TelephoneNumber <chr>, MeasureId <chr>,
## #   MeasureName <chr>, ComparedToNational <chr>, Denominator <chr>,
## #   Score <chr>, LowerEstimate <chr>, HigherEstimate <chr>, Footnote <chr>,
## #   StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- Complications_and_Deaths %>%
  select_if(is.numeric)

# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## named list()

Replacing NA values

# Replacing all "Not Applicable" with NA
Complications_and_Deaths <- as.data.frame(sapply(Complications_and_Deaths, function(x) {
  if (is.character(x)) {
    x[x == "Not Applicable"] <- NA
  }
  return(x)
}))

# Replacing all "Not Available" with NA
Complications_and_Deaths <- as.data.frame(sapply(Complications_and_Deaths, function(x) {
  if (is.character(x)) {
    x[x == "Not Available"] <- NA
  }
  return(x)
}))

Pivoting the data wider

deathsClean <- Complications_and_Deaths %>%
  pivot_wider(
    names_from = MeasureId, 
    values_from = c(ComparedToNational, Score), 
    id_cols = c(FacilityName, FacilityId, State)
  )

# Check the new dataframe
dim(deathsClean)
## [1] 4814   41
head(deathsClean)
## # A tibble: 6 × 41
##   FacilityName    FacilityId State ComparedToNational_C…¹ ComparedToNational_M…²
##   <chr>           <chr>      <chr> <chr>                  <chr>                 
## 1 SOUTHEAST HEAL… 010001     AL    No Different Than the… No Different Than the…
## 2 MARSHALL MEDIC… 010005     AL    No Different Than the… No Different Than the…
## 3 NORTH ALABAMA … 010006     AL    No Different Than the… Worse Than the Nation…
## 4 MIZELL MEMORIA… 010007     AL    Number of Cases Too S… Number of Cases Too S…
## 5 CRENSHAW COMMU… 010008     AL    <NA>                   Number of Cases Too S…
## 6 ST. VINCENT'S … 010011     AL    No Different Than the… No Different Than the…
## # ℹ abbreviated names: ¹​ComparedToNational_COMP_HIP_KNEE,
## #   ²​ComparedToNational_MORT_30_AMI
## # ℹ 36 more variables: ComparedToNational_MORT_30_CABG <chr>,
## #   ComparedToNational_MORT_30_COPD <chr>, ComparedToNational_MORT_30_HF <chr>,
## #   ComparedToNational_MORT_30_PN <chr>, ComparedToNational_MORT_30_STK <chr>,
## #   ComparedToNational_PSI_03 <chr>, ComparedToNational_PSI_04 <chr>,
## #   ComparedToNational_PSI_06 <chr>, ComparedToNational_PSI_08 <chr>, …

Exploring and Preprocessing the Payment_and_Value_of_Care dataset

Viewing and checking for missing values

# Display first 10 rows of Payment_and_Value_of_Care
head(Payment_and_Value_of_Care,10)
## # A tibble: 10 × 22
##    FacilityId FacilityName           Address CityTown State ZipCode CountyParish
##    <chr>      <chr>                  <chr>   <chr>    <chr> <chr>   <chr>       
##  1 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  2 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  3 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  4 010001     SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN   AL    36301   HOUSTON     
##  5 010005     MARSHALL MEDICAL CENT… 2505 U… BOAZ     AL    35957   MARSHALL    
##  6 010005     MARSHALL MEDICAL CENT… 2505 U… BOAZ     AL    35957   MARSHALL    
##  7 010005     MARSHALL MEDICAL CENT… 2505 U… BOAZ     AL    35957   MARSHALL    
##  8 010005     MARSHALL MEDICAL CENT… 2505 U… BOAZ     AL    35957   MARSHALL    
##  9 010006     NORTH ALABAMA MEDICAL… 1701 V… FLORENCE AL    35630   LAUDERDALE  
## 10 010006     NORTH ALABAMA MEDICAL… 1701 V… FLORENCE AL    35630   LAUDERDALE  
## # ℹ 15 more variables: TelephoneNumber <chr>, PaymentMeasureId <chr>,
## #   PaymentMeasureName <chr>, PaymentCategory <chr>, Denominator <chr>,
## #   Payment <chr>, LowerEstimate <chr>, HigherEstimate <chr>,
## #   PaymentFootnote <dbl>, ValueOfCareDisplayId <chr>,
## #   ValueOfCareDisplayName <chr>, ValueOfCareCategory <chr>,
## #   ValueOfCareFootnote <dbl>, StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- Payment_and_Value_of_Care %>%
  select_if(is.numeric)

# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
##     PaymentFootnote ValueOfCareFootnote 
##                9956               10044

Replacing NA values

# Replacing all "Not Applicable" with NA
Payment_and_Value_of_Care <- as.data.frame(sapply(Payment_and_Value_of_Care, function(x) {
  if (is.character(x)) {
    x[x == "Not Applicable"] <- NA
  }
  return(x)
}))

# Replacing all "Not Available" with NA
Payment_and_Value_of_Care <- as.data.frame(sapply(Payment_and_Value_of_Care, function(x) {
  if (is.character(x)) {
    x[x == "Not Available"] <- NA
  }
  return(x)
}))

Pivoting the data wider

paymentClean <- Payment_and_Value_of_Care %>%
  pivot_wider(
    names_from = PaymentMeasureId, 
    values_from = c(PaymentCategory, Payment), 
    id_cols = c(FacilityName, FacilityId, State)
  )

# Check the new dataframe
dim(paymentClean)
## [1] 4645   11
head(paymentClean)
## # A tibble: 6 × 11
##   FacilityName    FacilityId State PaymentCategory_PAYM…¹ PaymentCategory_PAYM…²
##   <chr>           <chr>      <chr> <chr>                  <chr>                 
## 1 SOUTHEAST HEAL… 010001     AL    No Different Than the… No Different Than the…
## 2 MARSHALL MEDIC… 010005     AL    No Different Than the… No Different Than the…
## 3 NORTH ALABAMA … 010006     AL    Greater Than the Nati… No Different Than the…
## 4 MIZELL MEMORIA… 010007     AL    Number of Cases Too S… No Different Than the…
## 5 CRENSHAW COMMU… 010008     AL    Number of Cases Too S… Number of Cases Too S…
## 6 ST. VINCENT'S … 010011     AL    No Different Than the… No Different Than the…
## # ℹ abbreviated names: ¹​PaymentCategory_PAYM_30_AMI,
## #   ²​PaymentCategory_PAYM_30_HF
## # ℹ 6 more variables: PaymentCategory_PAYM_30_PN <chr>,
## #   PaymentCategory_PAYM_90_HIP_KNEE <chr>, Payment_PAYM_30_AMI <chr>,
## #   Payment_PAYM_30_HF <chr>, Payment_PAYM_30_PN <chr>,
## #   Payment_PAYM_90_HIP_KNEE <chr>

Joining and cleaning the datasets

Joining the datasets based on FacilityId

HipKneeCleanTest <- readmissionsClean %>%
  full_join(HCAHPSClean, by = "FacilityId") %>%
  full_join(careClean, by = "FacilityId") %>%
  full_join(deathsClean, by = "FacilityId") %>%
  full_join(paymentClean, by = "FacilityId")

head(HipKneeCleanTest)
## # A tibble: 6 × 451
##   FacilityName.x                  FacilityId State.x NumberOfDischarges_HIP-KN…¹
##   <chr>                           <chr>      <chr>                         <dbl>
## 1 SOUTHEAST HEALTH MEDICAL CENTER 010001     AL                               NA
## 2 MARSHALL MEDICAL CENTERS        010005     AL                               NA
## 3 NORTH ALABAMA MEDICAL CENTER    010006     AL                               NA
## 4 MIZELL MEMORIAL HOSPITAL        010007     AL                               NA
## 5 CRENSHAW COMMUNITY HOSPITAL     010008     AL                               NA
## 6 ST. VINCENT'S EAST              010011     AL                               NA
## # ℹ abbreviated name: ¹​`NumberOfDischarges_HIP-KNEE`
## # ℹ 447 more variables: `ExcessReadmissionRatio_HIP-KNEE` <dbl>,
## #   `PredictedReadmissionRate_HIP-KNEE` <dbl>,
## #   `ExpectedReadmissionRate_HIP-KNEE` <dbl>,
## #   `NumberOfReadmissions_HIP-KNEE` <dbl>, FacilityName.y <chr>, State.y <chr>,
## #   PatientSurveyStarRating_H_COMP_1_A_P <chr>,
## #   PatientSurveyStarRating_H_COMP_1_SN_P <chr>, …

Removing redundant columns

# Removing duplicate columns
HipKneeCleanTest <- HipKneeCleanTest %>%
  select(-matches("\\.(x|y|z|w|v)$"))

Checking for NA Values

# Checking the dimensions
dim(HipKneeCleanTest)

# Count NA values in each column
na_counts <- sapply(HipKneeCleanTest, function(x) sum(is.na(x)))

# View the NA counts
print(na_counts)

Removing columns with more than 80% NA values

# Calculate the percentage of NA values for each column
na_percentage <- sapply(HipKneeCleanTest, function(x) mean(is.na(x)))

# Remove columns where more than 80% of the values are NA
HipKneeCleanTest <- HipKneeCleanTest[, na_percentage <= 0.8]

# Count NA values in each column
na_counts <- sapply(HipKneeCleanTest, function(x) sum(is.na(x)))

# View the NA counts
print(na_counts)

# Check the dimensions
dim(HipKneeCleanTest)

Removing answer percent and survey response percent columns

# Remove columns containing 'AnswerPercent' or 'SurveyResponseRate'
HipKneeCleanTest <- HipKneeCleanTest %>%
  select(-matches("AnswerPercent|SurveyResponseRate"))

# Check the dimensions
dim(HipKneeCleanTest)
## [1] 4816   87

Removing compared to national columns

# Remove columns containing 'ComparedToNational' and 'PaymentCategory'
HipKneeCleanTest <- HipKneeCleanTest %>%
  select(-matches("ComparedToNational|PaymentCategory"))

# Check the dimensions
dim(HipKneeCleanTest)
## [1] 4816   67

Checking data structure

str(HipKneeCleanTest)
## tibble [4,816 × 67] (S3: tbl_df/tbl/data.frame)
##  $ FacilityId                                      : chr [1:4816] "010001" "010005" "010006" "010007" ...
##  $ ExcessReadmissionRatio_HIP-KNEE                 : num [1:4816] 0.892 0.798 1.247 0.992 NA ...
##  $ PredictedReadmissionRate_HIP-KNEE               : num [1:4816] 3.53 3.76 5.52 4.34 NA ...
##  $ ExpectedReadmissionRate_HIP-KNEE                : num [1:4816] 3.96 4.72 4.43 4.37 NA ...
##  $ NumberOfReadmissions_HIP-KNEE                   : num [1:4816] 8 1 1 10 NA 2 4 6 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_1_STAR_RATING    : chr [1:4816] "3" "3" "2" "3" ...
##  $ PatientSurveyStarRating_H_COMP_2_STAR_RATING    : chr [1:4816] "4" "4" "3" "5" ...
##  $ PatientSurveyStarRating_H_COMP_3_STAR_RATING    : chr [1:4816] "3" "2" "2" "4" ...
##  $ PatientSurveyStarRating_H_COMP_5_STAR_RATING    : chr [1:4816] "3" "3" "2" "3" ...
##  $ PatientSurveyStarRating_H_COMP_6_STAR_RATING    : chr [1:4816] "4" "3" "3" "4" ...
##  $ PatientSurveyStarRating_H_COMP_7_STAR_RATING    : chr [1:4816] "4" "3" "2" "4" ...
##  $ PatientSurveyStarRating_H_CLEAN_STAR_RATING     : chr [1:4816] "3" "2" "1" "2" ...
##  $ PatientSurveyStarRating_H_QUIET_STAR_RATING     : chr [1:4816] "4" "4" "4" "4" ...
##  $ PatientSurveyStarRating_H_HSP_RATING_STAR_RATING: chr [1:4816] "4" "3" "2" "4" ...
##  $ PatientSurveyStarRating_H_RECMND_STAR_RATING    : chr [1:4816] "4" "3" "2" "4" ...
##  $ PatientSurveyStarRating_H_STAR_RATING           : chr [1:4816] "4" "3" "2" "4" ...
##  $ HcahpsLinearMeanValue_H_COMP_1_LINEAR_SCORE     : chr [1:4816] "89" "90" "88" "91" ...
##  $ HcahpsLinearMeanValue_H_COMP_2_LINEAR_SCORE     : chr [1:4816] "91" "92" "89" "95" ...
##  $ HcahpsLinearMeanValue_H_COMP_3_LINEAR_SCORE     : chr [1:4816] "81" "75" "75" "88" ...
##  $ HcahpsLinearMeanValue_H_COMP_5_LINEAR_SCORE     : chr [1:4816] "77" "76" "71" "77" ...
##  $ HcahpsLinearMeanValue_H_COMP_6_LINEAR_SCORE     : chr [1:4816] "87" "86" "83" "87" ...
##  $ HcahpsLinearMeanValue_H_COMP_7_LINEAR_SCORE     : chr [1:4816] "82" "79" "77" "82" ...
##  $ HcahpsLinearMeanValue_H_CLEAN_LINEAR_SCORE      : chr [1:4816] "84" "80" "74" "80" ...
##  $ HcahpsLinearMeanValue_H_QUIET_LINEAR_SCORE      : chr [1:4816] "86" "85" "85" "87" ...
##  $ HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE : chr [1:4816] "89" "85" "82" "89" ...
##  $ HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE     : chr [1:4816] "90" "83" "79" "88" ...
##  $ EDV                                             : chr [1:4816] "high" "high" "high" "low" ...
##  $ ED_2_Strata_1                                   : chr [1:4816] NA "148" NA NA ...
##  $ HCP_COVID_19                                    : chr [1:4816] "80.7" "79.8" "79" "57.9" ...
##  $ IMM_3                                           : chr [1:4816] "95" "80" "67" "53" ...
##  $ OP_18b                                          : chr [1:4816] "215" "147" "177" "130" ...
##  $ OP_18c                                          : chr [1:4816] "317" "266" NA "216" ...
##  $ OP_22                                           : chr [1:4816] "5" "3" "1" "4" ...
##  $ OP_23                                           : chr [1:4816] NA NA "69" NA ...
##  $ OP_29                                           : chr [1:4816] "47" "96" "85" "23" ...
##  $ SAFE_USE_OF_OPIOIDS                             : chr [1:4816] "14" "19" "17" NA ...
##  $ SEP_1                                           : chr [1:4816] "66" "74" "56" "86" ...
##  $ SEP_SH_3HR                                      : chr [1:4816] "70" "88" "77" NA ...
##  $ SEP_SH_6HR                                      : chr [1:4816] "100" "91" "81" NA ...
##  $ SEV_SEP_3HR                                     : chr [1:4816] "79" "88" "78" "89" ...
##  $ SEV_SEP_6HR                                     : chr [1:4816] "95" "96" "86" "97" ...
##  $ STK_02                                          : chr [1:4816] "98" "100" "96" NA ...
##  $ STK_05                                          : chr [1:4816] NA "91" NA NA ...
##  $ STK_06                                          : chr [1:4816] NA NA "97" NA ...
##  $ VTE_1                                           : chr [1:4816] "98" NA NA NA ...
##  $ VTE_2                                           : chr [1:4816] "99" NA "97" NA ...
##  $ Score_COMP_HIP_KNEE                             : chr [1:4816] "2.7" "2.3" "4.6" NA ...
##  $ Score_MORT_30_AMI                               : chr [1:4816] "12" "13.6" "16.5" NA ...
##  $ Score_MORT_30_COPD                              : chr [1:4816] "8.8" "9.9" "9.9" "13.7" ...
##  $ Score_MORT_30_HF                                : chr [1:4816] "8.9" "14.9" "12.5" "12.5" ...
##  $ Score_MORT_30_PN                                : chr [1:4816] "18" "23.3" "19.5" "28.5" ...
##  $ Score_MORT_30_STK                               : chr [1:4816] "14.8" "15.3" "17.2" NA ...
##  $ Score_PSI_03                                    : chr [1:4816] "0.39" "0.94" "1.39" "0.42" ...
##  $ Score_PSI_04                                    : chr [1:4816] "184.68" "183.49" "173.63" NA ...
##  $ Score_PSI_06                                    : chr [1:4816] "0.23" "0.22" "0.36" "0.24" ...
##  $ Score_PSI_08                                    : chr [1:4816] "0.10" "0.09" "0.08" "0.09" ...
##  $ Score_PSI_09                                    : chr [1:4816] "2.39" "2.69" "5.43" "2.49" ...
##  $ Score_PSI_10                                    : chr [1:4816] "1.14" "1.37" "1.26" "1.57" ...
##  $ Score_PSI_11                                    : chr [1:4816] "13.83" "7.19" "7.37" "8.45" ...
##  $ Score_PSI_12                                    : chr [1:4816] "4.49" "3.01" "3.36" "3.89" ...
##  $ Score_PSI_13                                    : chr [1:4816] "8.05" "4.46" "4.37" "5.19" ...
##  $ Score_PSI_14                                    : chr [1:4816] "1.69" "1.87" "1.76" NA ...
##  $ Score_PSI_15                                    : chr [1:4816] "0.93" "0.91" "1.34" "1.08" ...
##  $ Score_PSI_90                                    : chr [1:4816] "1.21" "0.97" "1.17" "0.95" ...
##  $ FacilityName                                    : chr [1:4816] "SOUTHEAST HEALTH MEDICAL CENTER" "MARSHALL MEDICAL CENTERS" "NORTH ALABAMA MEDICAL CENTER" "MIZELL MEMORIAL HOSPITAL" ...
##  $ State                                           : chr [1:4816] "AL" "AL" "AL" "AL" ...
##  $ Payment_PAYM_90_HIP_KNEE                        : chr [1:4816] "$22,212" "$18,030" "$21,898" NA ...
# Convert columns to numeric
HipKneeCleanTest <- HipKneeCleanTest %>%
  mutate_at(vars(starts_with("PatientSurveyStarRating_"), 
                 starts_with("HcahpsLinearMeanValue_"), 
                 starts_with("Score_"),
                 starts_with("ED_"),
                 starts_with("IMM_"),
                 starts_with("OP_"),
                 starts_with("SEP_"),
                 starts_with("SEV_"),
                 starts_with("STK_"),
                 starts_with("VTE_"),
                 starts_with("SAFE_"),
                 starts_with("HCP_")),
            ~ as.numeric(as.character(.)))

# View the structure
str(HipKneeCleanTest)
## tibble [4,816 × 67] (S3: tbl_df/tbl/data.frame)
##  $ FacilityId                                      : chr [1:4816] "010001" "010005" "010006" "010007" ...
##  $ ExcessReadmissionRatio_HIP-KNEE                 : num [1:4816] 0.892 0.798 1.247 0.992 NA ...
##  $ PredictedReadmissionRate_HIP-KNEE               : num [1:4816] 3.53 3.76 5.52 4.34 NA ...
##  $ ExpectedReadmissionRate_HIP-KNEE                : num [1:4816] 3.96 4.72 4.43 4.37 NA ...
##  $ NumberOfReadmissions_HIP-KNEE                   : num [1:4816] 8 1 1 10 NA 2 4 6 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_1_STAR_RATING    : num [1:4816] 3 3 2 3 NA 3 3 3 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_2_STAR_RATING    : num [1:4816] 4 4 3 5 NA 3 4 4 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_3_STAR_RATING    : num [1:4816] 3 2 2 4 NA 4 3 2 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_5_STAR_RATING    : num [1:4816] 3 3 2 3 NA 3 3 2 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_6_STAR_RATING    : num [1:4816] 4 3 3 4 NA 3 3 2 NA 3 ...
##  $ PatientSurveyStarRating_H_COMP_7_STAR_RATING    : num [1:4816] 4 3 2 4 NA 3 3 3 NA 4 ...
##  $ PatientSurveyStarRating_H_CLEAN_STAR_RATING     : num [1:4816] 3 2 1 2 NA 2 2 1 NA 4 ...
##  $ PatientSurveyStarRating_H_QUIET_STAR_RATING     : num [1:4816] 4 4 4 4 NA 4 4 3 NA 5 ...
##  $ PatientSurveyStarRating_H_HSP_RATING_STAR_RATING: num [1:4816] 4 3 2 4 NA 3 2 3 NA 4 ...
##  $ PatientSurveyStarRating_H_RECMND_STAR_RATING    : num [1:4816] 4 3 2 4 NA 4 2 3 NA 4 ...
##  $ PatientSurveyStarRating_H_STAR_RATING           : num [1:4816] 4 3 2 4 NA 3 3 3 NA 4 ...
##  $ HcahpsLinearMeanValue_H_COMP_1_LINEAR_SCORE     : num [1:4816] 89 90 88 91 NA 90 91 89 NA 92 ...
##  $ HcahpsLinearMeanValue_H_COMP_2_LINEAR_SCORE     : num [1:4816] 91 92 89 95 NA 90 91 91 NA 92 ...
##  $ HcahpsLinearMeanValue_H_COMP_3_LINEAR_SCORE     : num [1:4816] 81 75 75 88 NA 85 80 78 NA 85 ...
##  $ HcahpsLinearMeanValue_H_COMP_5_LINEAR_SCORE     : num [1:4816] 77 76 71 77 NA 76 76 72 NA 78 ...
##  $ HcahpsLinearMeanValue_H_COMP_6_LINEAR_SCORE     : num [1:4816] 87 86 83 87 NA 86 86 81 NA 86 ...
##  $ HcahpsLinearMeanValue_H_COMP_7_LINEAR_SCORE     : num [1:4816] 82 79 77 82 NA 81 79 80 NA 83 ...
##  $ HcahpsLinearMeanValue_H_CLEAN_LINEAR_SCORE      : num [1:4816] 84 80 74 80 NA 81 83 78 NA 88 ...
##  $ HcahpsLinearMeanValue_H_QUIET_LINEAR_SCORE      : num [1:4816] 86 85 85 87 NA 84 84 82 NA 89 ...
##  $ HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE : num [1:4816] 89 85 82 89 NA 88 83 85 NA 90 ...
##  $ HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE     : num [1:4816] 90 83 79 88 NA 87 80 84 NA 91 ...
##  $ EDV                                             : chr [1:4816] "high" "high" "high" "low" ...
##  $ ED_2_Strata_1                                   : num [1:4816] NA 148 NA NA NA NA NA NA NA NA ...
##  $ HCP_COVID_19                                    : num [1:4816] 80.7 79.8 79 57.9 81.2 88 69.8 87.3 95.9 85.3 ...
##  $ IMM_3                                           : num [1:4816] 95 80 67 53 45 81 65 93 98 81 ...
##  $ OP_18b                                          : num [1:4816] 215 147 177 130 118 206 160 185 102 145 ...
##  $ OP_18c                                          : num [1:4816] 317 266 NA 216 98 124 220 220 NA 324 ...
##  $ OP_22                                           : num [1:4816] 5 3 1 4 0 5 4 3 0 2 ...
##  $ OP_23                                           : num [1:4816] NA NA 69 NA NA 47 NA 73 NA 35 ...
##  $ OP_29                                           : num [1:4816] 47 96 85 23 67 100 100 NA NA 82 ...
##  $ SAFE_USE_OF_OPIOIDS                             : num [1:4816] 14 19 17 NA NA 20 14 23 NA 17 ...
##  $ SEP_1                                           : num [1:4816] 66 74 56 86 NA 51 92 77 NA 87 ...
##  $ SEP_SH_3HR                                      : num [1:4816] 70 88 77 NA NA 78 94 83 NA 90 ...
##  $ SEP_SH_6HR                                      : num [1:4816] 100 91 81 NA NA 81 83 100 NA 94 ...
##  $ SEV_SEP_3HR                                     : num [1:4816] 79 88 78 89 NA 69 95 85 NA 94 ...
##  $ SEV_SEP_6HR                                     : num [1:4816] 95 96 86 97 NA 91 99 97 NA 99 ...
##  $ STK_02                                          : num [1:4816] 98 100 96 NA NA 93 NA 99 NA NA ...
##  $ STK_05                                          : num [1:4816] NA 91 NA NA NA NA NA NA NA NA ...
##  $ STK_06                                          : num [1:4816] NA NA 97 NA NA NA NA NA NA NA ...
##  $ VTE_1                                           : num [1:4816] 98 NA NA NA NA 79 89 84 44 59 ...
##  $ VTE_2                                           : num [1:4816] 99 NA 97 NA NA 88 93 94 NA NA ...
##  $ Score_COMP_HIP_KNEE                             : num [1:4816] 2.7 2.3 4.6 NA NA 3.5 3.8 3.5 NA 4.3 ...
##  $ Score_MORT_30_AMI                               : num [1:4816] 12 13.6 16.5 NA NA 13.2 13.8 13.1 NA NA ...
##  $ Score_MORT_30_COPD                              : num [1:4816] 8.8 9.9 9.9 13.7 NA 10.3 NA 9.2 NA 7.8 ...
##  $ Score_MORT_30_HF                                : num [1:4816] 8.9 14.9 12.5 12.5 NA 13.5 13.6 9.9 NA 16.9 ...
##  $ Score_MORT_30_PN                                : num [1:4816] 18 23.3 19.5 28.5 NA 20.9 22 17.2 NA 26.1 ...
##  $ Score_MORT_30_STK                               : num [1:4816] 14.8 15.3 17.2 NA NA 12.3 NA 13.2 NA 17.3 ...
##  $ Score_PSI_03                                    : num [1:4816] 0.39 0.94 1.39 0.42 0.54 0.13 0.41 0.63 0.57 0.47 ...
##  $ Score_PSI_04                                    : num [1:4816] 185 183 174 NA NA ...
##  $ Score_PSI_06                                    : num [1:4816] 0.23 0.22 0.36 0.24 0.25 0.24 0.24 0.21 0.25 0.22 ...
##  $ Score_PSI_08                                    : num [1:4816] 0.1 0.09 0.08 0.09 0.09 0.08 0.09 0.09 0.09 0.09 ...
##  $ Score_PSI_09                                    : num [1:4816] 2.39 2.69 5.43 2.49 NA 1.88 2.44 3.29 2.44 2.58 ...
##  $ Score_PSI_10                                    : num [1:4816] 1.14 1.37 1.26 1.57 NA 1.72 1.51 1.2 1.57 NA ...
##  $ Score_PSI_11                                    : num [1:4816] 13.83 7.19 7.37 8.45 NA ...
##  $ Score_PSI_12                                    : num [1:4816] 4.49 3.01 3.36 3.89 NA 3.04 3.32 3.67 3.56 5.63 ...
##  $ Score_PSI_13                                    : num [1:4816] 8.05 4.46 4.37 5.19 NA 5.55 4.88 6.08 5.18 NA ...
##  $ Score_PSI_14                                    : num [1:4816] 1.69 1.87 1.76 NA NA 1.86 2.46 2.77 NA 1.83 ...
##  $ Score_PSI_15                                    : num [1:4816] 0.93 0.91 1.34 1.08 NA 1.18 1.04 0.84 NA 0.88 ...
##  $ Score_PSI_90                                    : num [1:4816] 1.21 0.97 1.17 0.95 NA 0.72 0.89 1.17 0.98 1.05 ...
##  $ FacilityName                                    : chr [1:4816] "SOUTHEAST HEALTH MEDICAL CENTER" "MARSHALL MEDICAL CENTERS" "NORTH ALABAMA MEDICAL CENTER" "MIZELL MEMORIAL HOSPITAL" ...
##  $ State                                           : chr [1:4816] "AL" "AL" "AL" "AL" ...
##  $ Payment_PAYM_90_HIP_KNEE                        : chr [1:4816] "$22,212" "$18,030" "$21,898" NA ...

Fixing the payment column

# Remove $ and , and convert to numeric
HipKneeCleanTest <- HipKneeCleanTest %>%
  mutate_at(vars(starts_with("Payment_")), 
            ~ as.numeric(gsub("[\\$,]", "", .)))

# Checking the structure
str(HipKneeCleanTest)
## tibble [4,816 × 67] (S3: tbl_df/tbl/data.frame)
##  $ FacilityId                                      : chr [1:4816] "010001" "010005" "010006" "010007" ...
##  $ ExcessReadmissionRatio_HIP-KNEE                 : num [1:4816] 0.892 0.798 1.247 0.992 NA ...
##  $ PredictedReadmissionRate_HIP-KNEE               : num [1:4816] 3.53 3.76 5.52 4.34 NA ...
##  $ ExpectedReadmissionRate_HIP-KNEE                : num [1:4816] 3.96 4.72 4.43 4.37 NA ...
##  $ NumberOfReadmissions_HIP-KNEE                   : num [1:4816] 8 1 1 10 NA 2 4 6 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_1_STAR_RATING    : num [1:4816] 3 3 2 3 NA 3 3 3 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_2_STAR_RATING    : num [1:4816] 4 4 3 5 NA 3 4 4 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_3_STAR_RATING    : num [1:4816] 3 2 2 4 NA 4 3 2 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_5_STAR_RATING    : num [1:4816] 3 3 2 3 NA 3 3 2 NA 4 ...
##  $ PatientSurveyStarRating_H_COMP_6_STAR_RATING    : num [1:4816] 4 3 3 4 NA 3 3 2 NA 3 ...
##  $ PatientSurveyStarRating_H_COMP_7_STAR_RATING    : num [1:4816] 4 3 2 4 NA 3 3 3 NA 4 ...
##  $ PatientSurveyStarRating_H_CLEAN_STAR_RATING     : num [1:4816] 3 2 1 2 NA 2 2 1 NA 4 ...
##  $ PatientSurveyStarRating_H_QUIET_STAR_RATING     : num [1:4816] 4 4 4 4 NA 4 4 3 NA 5 ...
##  $ PatientSurveyStarRating_H_HSP_RATING_STAR_RATING: num [1:4816] 4 3 2 4 NA 3 2 3 NA 4 ...
##  $ PatientSurveyStarRating_H_RECMND_STAR_RATING    : num [1:4816] 4 3 2 4 NA 4 2 3 NA 4 ...
##  $ PatientSurveyStarRating_H_STAR_RATING           : num [1:4816] 4 3 2 4 NA 3 3 3 NA 4 ...
##  $ HcahpsLinearMeanValue_H_COMP_1_LINEAR_SCORE     : num [1:4816] 89 90 88 91 NA 90 91 89 NA 92 ...
##  $ HcahpsLinearMeanValue_H_COMP_2_LINEAR_SCORE     : num [1:4816] 91 92 89 95 NA 90 91 91 NA 92 ...
##  $ HcahpsLinearMeanValue_H_COMP_3_LINEAR_SCORE     : num [1:4816] 81 75 75 88 NA 85 80 78 NA 85 ...
##  $ HcahpsLinearMeanValue_H_COMP_5_LINEAR_SCORE     : num [1:4816] 77 76 71 77 NA 76 76 72 NA 78 ...
##  $ HcahpsLinearMeanValue_H_COMP_6_LINEAR_SCORE     : num [1:4816] 87 86 83 87 NA 86 86 81 NA 86 ...
##  $ HcahpsLinearMeanValue_H_COMP_7_LINEAR_SCORE     : num [1:4816] 82 79 77 82 NA 81 79 80 NA 83 ...
##  $ HcahpsLinearMeanValue_H_CLEAN_LINEAR_SCORE      : num [1:4816] 84 80 74 80 NA 81 83 78 NA 88 ...
##  $ HcahpsLinearMeanValue_H_QUIET_LINEAR_SCORE      : num [1:4816] 86 85 85 87 NA 84 84 82 NA 89 ...
##  $ HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE : num [1:4816] 89 85 82 89 NA 88 83 85 NA 90 ...
##  $ HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE     : num [1:4816] 90 83 79 88 NA 87 80 84 NA 91 ...
##  $ EDV                                             : chr [1:4816] "high" "high" "high" "low" ...
##  $ ED_2_Strata_1                                   : num [1:4816] NA 148 NA NA NA NA NA NA NA NA ...
##  $ HCP_COVID_19                                    : num [1:4816] 80.7 79.8 79 57.9 81.2 88 69.8 87.3 95.9 85.3 ...
##  $ IMM_3                                           : num [1:4816] 95 80 67 53 45 81 65 93 98 81 ...
##  $ OP_18b                                          : num [1:4816] 215 147 177 130 118 206 160 185 102 145 ...
##  $ OP_18c                                          : num [1:4816] 317 266 NA 216 98 124 220 220 NA 324 ...
##  $ OP_22                                           : num [1:4816] 5 3 1 4 0 5 4 3 0 2 ...
##  $ OP_23                                           : num [1:4816] NA NA 69 NA NA 47 NA 73 NA 35 ...
##  $ OP_29                                           : num [1:4816] 47 96 85 23 67 100 100 NA NA 82 ...
##  $ SAFE_USE_OF_OPIOIDS                             : num [1:4816] 14 19 17 NA NA 20 14 23 NA 17 ...
##  $ SEP_1                                           : num [1:4816] 66 74 56 86 NA 51 92 77 NA 87 ...
##  $ SEP_SH_3HR                                      : num [1:4816] 70 88 77 NA NA 78 94 83 NA 90 ...
##  $ SEP_SH_6HR                                      : num [1:4816] 100 91 81 NA NA 81 83 100 NA 94 ...
##  $ SEV_SEP_3HR                                     : num [1:4816] 79 88 78 89 NA 69 95 85 NA 94 ...
##  $ SEV_SEP_6HR                                     : num [1:4816] 95 96 86 97 NA 91 99 97 NA 99 ...
##  $ STK_02                                          : num [1:4816] 98 100 96 NA NA 93 NA 99 NA NA ...
##  $ STK_05                                          : num [1:4816] NA 91 NA NA NA NA NA NA NA NA ...
##  $ STK_06                                          : num [1:4816] NA NA 97 NA NA NA NA NA NA NA ...
##  $ VTE_1                                           : num [1:4816] 98 NA NA NA NA 79 89 84 44 59 ...
##  $ VTE_2                                           : num [1:4816] 99 NA 97 NA NA 88 93 94 NA NA ...
##  $ Score_COMP_HIP_KNEE                             : num [1:4816] 2.7 2.3 4.6 NA NA 3.5 3.8 3.5 NA 4.3 ...
##  $ Score_MORT_30_AMI                               : num [1:4816] 12 13.6 16.5 NA NA 13.2 13.8 13.1 NA NA ...
##  $ Score_MORT_30_COPD                              : num [1:4816] 8.8 9.9 9.9 13.7 NA 10.3 NA 9.2 NA 7.8 ...
##  $ Score_MORT_30_HF                                : num [1:4816] 8.9 14.9 12.5 12.5 NA 13.5 13.6 9.9 NA 16.9 ...
##  $ Score_MORT_30_PN                                : num [1:4816] 18 23.3 19.5 28.5 NA 20.9 22 17.2 NA 26.1 ...
##  $ Score_MORT_30_STK                               : num [1:4816] 14.8 15.3 17.2 NA NA 12.3 NA 13.2 NA 17.3 ...
##  $ Score_PSI_03                                    : num [1:4816] 0.39 0.94 1.39 0.42 0.54 0.13 0.41 0.63 0.57 0.47 ...
##  $ Score_PSI_04                                    : num [1:4816] 185 183 174 NA NA ...
##  $ Score_PSI_06                                    : num [1:4816] 0.23 0.22 0.36 0.24 0.25 0.24 0.24 0.21 0.25 0.22 ...
##  $ Score_PSI_08                                    : num [1:4816] 0.1 0.09 0.08 0.09 0.09 0.08 0.09 0.09 0.09 0.09 ...
##  $ Score_PSI_09                                    : num [1:4816] 2.39 2.69 5.43 2.49 NA 1.88 2.44 3.29 2.44 2.58 ...
##  $ Score_PSI_10                                    : num [1:4816] 1.14 1.37 1.26 1.57 NA 1.72 1.51 1.2 1.57 NA ...
##  $ Score_PSI_11                                    : num [1:4816] 13.83 7.19 7.37 8.45 NA ...
##  $ Score_PSI_12                                    : num [1:4816] 4.49 3.01 3.36 3.89 NA 3.04 3.32 3.67 3.56 5.63 ...
##  $ Score_PSI_13                                    : num [1:4816] 8.05 4.46 4.37 5.19 NA 5.55 4.88 6.08 5.18 NA ...
##  $ Score_PSI_14                                    : num [1:4816] 1.69 1.87 1.76 NA NA 1.86 2.46 2.77 NA 1.83 ...
##  $ Score_PSI_15                                    : num [1:4816] 0.93 0.91 1.34 1.08 NA 1.18 1.04 0.84 NA 0.88 ...
##  $ Score_PSI_90                                    : num [1:4816] 1.21 0.97 1.17 0.95 NA 0.72 0.89 1.17 0.98 1.05 ...
##  $ FacilityName                                    : chr [1:4816] "SOUTHEAST HEALTH MEDICAL CENTER" "MARSHALL MEDICAL CENTERS" "NORTH ALABAMA MEDICAL CENTER" "MIZELL MEMORIAL HOSPITAL" ...
##  $ State                                           : chr [1:4816] "AL" "AL" "AL" "AL" ...
##  $ Payment_PAYM_90_HIP_KNEE                        : num [1:4816] 22212 18030 21898 NA NA ...

Encoding categorical variables

Identify Categorical Variables

# Create function to find categorical variables
is_categorical <- function(x) is.factor(x) | is.character(x)

# Apply function to all variables in the dataset
categorical_vars <- sapply(HipKneeClean, is_categorical)

# Print the names of all categorical variables
categorical <- names(HipKneeClean)[categorical_vars]
categorical
## [1] "FacilityId"   "EDV"          "FacilityName" "State"

Dummy encode the EDV column

# Define the encoding mapping (ignore NAs for now)
encoding_map <- c(
  'low' = 1,
  'medium' = 2,
  'high' = 3,
  'very high' = 4
)
# Dummy encoding used due to ordinal nature of this data

# Create a copy of HipKneeCleanTest and name it HipKneeTest to separate cleaned dataset and the test dataset
HipKneeTest <- HipKneeCleanTest %>%
  mutate(EDV = recode(EDV, !!!encoding_map))

# Print first 20 rows of EDV column in HipKneeClean and HipKneeTrain to ensure proper encoding
cat("HipKneeCleanTest")
## HipKneeCleanTest
print(head(HipKneeCleanTest$EDV, 20))
##  [1] "high"      "high"      "high"      "low"       "low"       "high"     
##  [7] "low"       "medium"    "low"       "medium"    "low"       "low"      
## [13] "high"      "high"      "very high" "very high" "low"       "high"     
## [19] "low"       "very high"
cat("HipKneeTest")
## HipKneeTest
print(head(HipKneeTest$EDV, 20))
##  [1] 3 3 3 1 1 3 1 2 1 2 1 1 3 3 4 4 1 3 1 4

Encode each state in alphabetical order

# Manually map out each state with their respective code in alphabetical order with a preceding 0 to make data non-ordinal
state_mapping <- c(
  "AL" = "001",
  "AK" = "002",
  "AZ" = "003",
  "AR" = "004",
  "CA" = "005",
  "CO" = "006",
  "CT" = "007",
  "DE" = "008",
  "FL" = "009",
  "GA" = "010",
  "HI" = "011",
  "ID" = "012",
  "IL" = "013",
  "IN" = "014",
  "IA" = "015",
  "KS" = "016",
  "KY" = "017",
  "LA" = "018",
  "ME" = "019",
  "MD" = "020",
  "MA" = "021",
  "MI" = "022",
  "MN" = "023",
  "MS" = "024",
  "MO" = "025",
  "MT" = "026",
  "NE" = "027",
  "NV" = "028",
  "NH" = "029",
  "NJ" = "030",
  "NM" = "031",
  "NY" = "032",
  "NC" = "033",
  "ND" = "034",
  "OH" = "035",
  "OK" = "036",
  "OR" = "037",
  "PA" = "038",
  "RI" = "039",
  "SC" = "040",
  "SD" = "041",
  "TN" = "042",
  "TX" = "043",
  "UT" = "044",
  "VT" = "045",
  "VA" = "046",
  "WA" = "047",
  "WV" = "048",
  "WI" = "049",
  "WY" = "050"
)

# Create new "StateCode" column with the encoded values
HipKneeTest <- HipKneeTest %>%
  mutate(StateCode = state_mapping[State])

# Print 100 rows of the "State" and "StateCode" columns to ensure accuracy
print("State and StateCode Columns")
## [1] "State and StateCode Columns"
print(head(HipKneeTest[c("State", "StateCode")], 100))
## # A tibble: 100 × 2
##    State StateCode
##    <chr> <chr>    
##  1 AL    001      
##  2 AL    001      
##  3 AL    001      
##  4 AL    001      
##  5 AL    001      
##  6 AL    001      
##  7 AL    001      
##  8 AL    001      
##  9 AL    001      
## 10 AL    001      
## # ℹ 90 more rows
# Print all unique values in "StateCode" column to ensure accuracy
print("Unique StateCode Values")
## [1] "Unique StateCode Values"
print(unique(HipKneeTest$StateCode))
##  [1] "001" "002" "003" "004" "005" "006" "007" "008" NA    "009" "010" "011"
## [13] "012" "013" "014" "015" "016" "017" "018" "019" "020" "021" "022" "023"
## [25] "024" "025" "026" "027" "028" "029" "030" "031" "032" "033" "034" "035"
## [37] "036" "037" "038" "039" "040" "041" "042" "043" "044" "045" "046" "047"
## [49] "048" "049" "050"

Collinearity and Feature Removal

Remove correlated and unnecessary variables

# Specify columns to remove
columns_to_remove <- c(
  "ED_2_Strata_1",
  "OP_23",
  "VTE_2",
  "OP_18c",
  "OP_22",
  "STK_02",
  "STK_05",
  "STK_06",
  "HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE",
  "NumberOfReadmissions_HIP-KNEE",
  "ExcessReadmissionRatio_HIP-KNEE",
  "ExpectedReadmissionRate_HIP-KNEE",
  "SEP_1",
  "SEV_SEP_6HR",
  "SEV_SEP_3HR",
  "SEP_SH_6HR",
  "SEP_SH_3HR",
  "Score_PSI_90",
  "PatientSurveyStarRating_H_COMP_1_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_2_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_3_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_5_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_6_STAR_RATING",    
  "PatientSurveyStarRating_H_COMP_7_STAR_RATING",    
  "PatientSurveyStarRating_H_CLEAN_STAR_RATING",     
  "PatientSurveyStarRating_H_QUIET_STAR_RATING",     
  "PatientSurveyStarRating_H_HSP_RATING_STAR_RATING",
  "PatientSurveyStarRating_H_RECMND_STAR_RATING",    
  "PatientSurveyStarRating_H_STAR_RATING",
  "HcahpsLinearMeanValue_H_COMP_1_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_2_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_3_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_5_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_6_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_COMP_7_LINEAR_SCORE",    
  "HcahpsLinearMeanValue_H_CLEAN_LINEAR_SCORE",     
  "HcahpsLinearMeanValue_H_QUIET_LINEAR_SCORE"
)

# Remove specified columns
HipKneeTest <- HipKneeTest %>% select(-all_of(columns_to_remove))
# Print column names to verify
print("Remaining columns:")
## [1] "Remaining columns:"
print(colnames(HipKneeTest))
##  [1] "FacilityId"                                     
##  [2] "PredictedReadmissionRate_HIP-KNEE"              
##  [3] "HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE"
##  [4] "EDV"                                            
##  [5] "HCP_COVID_19"                                   
##  [6] "IMM_3"                                          
##  [7] "OP_18b"                                         
##  [8] "OP_29"                                          
##  [9] "SAFE_USE_OF_OPIOIDS"                            
## [10] "VTE_1"                                          
## [11] "Score_COMP_HIP_KNEE"                            
## [12] "Score_MORT_30_AMI"                              
## [13] "Score_MORT_30_COPD"                             
## [14] "Score_MORT_30_HF"                               
## [15] "Score_MORT_30_PN"                               
## [16] "Score_MORT_30_STK"                              
## [17] "Score_PSI_03"                                   
## [18] "Score_PSI_04"                                   
## [19] "Score_PSI_06"                                   
## [20] "Score_PSI_08"                                   
## [21] "Score_PSI_09"                                   
## [22] "Score_PSI_10"                                   
## [23] "Score_PSI_11"                                   
## [24] "Score_PSI_12"                                   
## [25] "Score_PSI_13"                                   
## [26] "Score_PSI_14"                                   
## [27] "Score_PSI_15"                                   
## [28] "FacilityName"                                   
## [29] "State"                                          
## [30] "Payment_PAYM_90_HIP_KNEE"                       
## [31] "StateCode"

Reassess collinearity with heatmap and correlation matrix

# Compute correlation matrix
cor_matrix <- cor(HipKneeTest %>% select_if(is.numeric), use = "pairwise.complete.obs")

# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)

# Plot heatmap
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Figure 5. Correlation Heatmap of Numeric Variables")

# Convert correlation matrix to df
cor_table <- as.data.frame(cor_matrix)

# Add variable names as a column
cor_table$Variable <- rownames(cor_table)

# Reorder columns
cor_table <- cor_table %>%
  select(Variable, everything())

# Print table
cor_table %>%
  kable(caption = "Table 8. Correlation Coefficients Table") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
Table 8. Correlation Coefficients Table
Variable PredictedReadmissionRate_HIP-KNEE HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE EDV HCP_COVID_19 IMM_3 OP_18b OP_29 SAFE_USE_OF_OPIOIDS VTE_1 Score_COMP_HIP_KNEE Score_MORT_30_AMI Score_MORT_30_COPD Score_MORT_30_HF Score_MORT_30_PN Score_MORT_30_STK Score_PSI_03 Score_PSI_04 Score_PSI_06 Score_PSI_08 Score_PSI_09 Score_PSI_10 Score_PSI_11 Score_PSI_12 Score_PSI_13 Score_PSI_14 Score_PSI_15 Payment_PAYM_90_HIP_KNEE
PredictedReadmissionRate_HIP-KNEE PredictedReadmissionRate_HIP-KNEE 1.0000000 -0.2060912 0.1986939 -0.0563082 -0.0028840 0.1295727 -0.0106510 0.1063002 0.0654668 0.3208550 0.0074065 -0.0794948 -0.1067828 -0.0985660 -0.0376746 -0.0037334 -0.0449077 0.0154891 -0.0214412 -0.0182303 0.0710046 0.1130121 0.1047402 0.1193336 0.0140012 -0.0158282 0.2975679
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE -0.2060912 1.0000000 -0.2262341 0.0302154 0.2182505 -0.2448842 0.0865028 0.1100002 -0.0248375 -0.1091775 -0.0952111 -0.0230632 0.0300940 -0.0702915 -0.0760295 -0.0499063 -0.0948401 -0.0054742 -0.0590998 0.0707774 -0.0157615 -0.1615833 -0.0674588 -0.1458284 -0.0172231 0.0304634 -0.2108956
EDV EDV 0.1986939 -0.2262341 1.0000000 0.1599806 0.0038674 0.5918897 0.0603877 -0.1223889 0.2992859 -0.0240093 -0.0687401 -0.0840621 -0.2588739 -0.1904351 -0.0754281 0.0292819 -0.0438108 -0.0308989 -0.1637017 -0.0181523 0.0837980 0.0455190 0.0749218 0.0724936 0.0399935 0.0144400 0.0053946
HCP_COVID_19 HCP_COVID_19 -0.0563082 0.0302154 0.1599806 1.0000000 0.3203622 0.2574291 0.1067941 -0.0812735 0.0241622 -0.0510683 -0.0869890 -0.1128278 -0.1245435 -0.1523779 -0.0988833 0.0943953 0.0417007 0.0225916 -0.0272232 0.0549990 0.0009222 -0.0909811 0.1091949 -0.0160408 0.0169512 0.0430812 -0.0627505
IMM_3 IMM_3 -0.0028840 0.2182505 0.0038674 0.3203622 1.0000000 0.1105628 0.1317922 0.0410289 0.0906329 -0.0212916 -0.0165321 -0.0616051 -0.0010634 -0.0761104 0.0146397 0.0451508 0.0601831 0.0544576 -0.0311579 0.0899361 0.0412713 -0.0625676 0.0594933 -0.0250714 0.0723872 0.0625753 -0.0720431
OP_18b OP_18b 0.1295727 -0.2448842 0.5918897 0.2574291 0.1105628 1.0000000 0.0506067 -0.1400845 0.2344307 -0.0293698 -0.0678837 -0.1527290 -0.2195933 -0.1858291 -0.0905644 0.0583644 0.0638412 0.0187794 -0.0806516 0.0224833 0.0544528 -0.0076797 0.1653812 0.0993570 0.0676972 0.0374530 -0.0241965
OP_29 OP_29 -0.0106510 0.0865028 0.0603877 0.1067941 0.1317922 0.0506067 1.0000000 -0.0650231 0.1567526 -0.0096464 -0.0600569 0.0081252 0.0099705 -0.0536184 -0.0251654 -0.0032584 0.0312780 0.0059688 -0.0199006 0.0150699 0.0331569 -0.0837910 -0.0108546 -0.0087678 0.0163779 0.0419068 -0.0815209
SAFE_USE_OF_OPIOIDS SAFE_USE_OF_OPIOIDS 0.1063002 0.1100002 -0.1223889 -0.0812735 0.0410289 -0.1400845 -0.0650231 1.0000000 -0.0563373 -0.0081923 -0.0643353 -0.0362573 0.0171605 -0.0204107 -0.0804707 -0.0287158 -0.0995591 -0.0603629 -0.0017558 -0.0043396 -0.0382369 0.0137283 -0.0380145 -0.0258097 -0.0166146 -0.0268805 -0.0048449
VTE_1 VTE_1 0.0654668 -0.0248375 0.2992859 0.0241622 0.0906329 0.2344307 0.1567526 -0.0563373 1.0000000 -0.0526925 -0.0493931 -0.0282911 -0.1051171 -0.1710235 -0.1238916 -0.0378500 -0.1180160 -0.0262166 -0.0522876 -0.0577213 -0.0037803 -0.0301775 -0.0317534 -0.0440578 0.0086141 0.0326658 -0.1256363
Score_COMP_HIP_KNEE Score_COMP_HIP_KNEE 0.3208550 -0.1091775 -0.0240093 -0.0510683 -0.0212916 -0.0293698 -0.0096464 -0.0081923 -0.0526925 1.0000000 0.0830479 -0.0203930 -0.0007242 0.0241066 0.0211621 0.0498557 0.0038509 0.0505415 0.0577776 0.0540124 0.0813038 0.1279724 0.1458258 0.1334619 0.0498603 0.0433809 0.3410864
Score_MORT_30_AMI Score_MORT_30_AMI 0.0074065 -0.0952111 -0.0687401 -0.0869890 -0.0165321 -0.0678837 -0.0600569 -0.0643353 -0.0493931 0.0830479 1.0000000 0.2498600 0.3407616 0.3309425 0.2222539 0.0415523 0.2105379 0.0885083 0.1010348 0.0889343 0.1066619 0.1037006 0.0492328 0.0467554 0.0454462 0.0297688 0.0591548
Score_MORT_30_COPD Score_MORT_30_COPD -0.0794948 -0.0230632 -0.0840621 -0.1128278 -0.0616051 -0.1527290 0.0081252 -0.0362573 -0.0282911 -0.0203930 0.2498600 1.0000000 0.3844105 0.3710744 0.2038243 -0.0069743 0.1713379 0.0478268 0.0397571 0.0429090 0.0320669 0.0426574 -0.0532586 0.0026944 0.0734846 0.0340007 -0.0406696
Score_MORT_30_HF Score_MORT_30_HF -0.1067828 0.0300940 -0.2588739 -0.1245435 -0.0010634 -0.2195933 0.0099705 0.0171605 -0.1051171 -0.0007242 0.3407616 0.3844105 1.0000000 0.4479367 0.3147371 0.0371596 0.2556384 0.0679149 0.1051698 0.0707269 0.0383771 0.0362529 -0.0300702 -0.0086832 0.0647245 0.0342374 -0.0350247
Score_MORT_30_PN Score_MORT_30_PN -0.0985660 -0.0702915 -0.1904351 -0.1523779 -0.0761104 -0.1858291 -0.0536184 -0.0204107 -0.1710235 0.0241066 0.3309425 0.3710744 0.4479367 1.0000000 0.3042563 0.0303815 0.2301195 0.0543554 0.0884315 0.0217880 0.0237048 0.0704445 0.0089560 0.0393676 0.0464407 0.0029691 -0.0062985
Score_MORT_30_STK Score_MORT_30_STK -0.0376746 -0.0760295 -0.0754281 -0.0988833 0.0146397 -0.0905644 -0.0251654 -0.0804707 -0.1238916 0.0211621 0.2222539 0.2038243 0.3147371 0.3042563 1.0000000 0.0687216 0.2380935 0.0878847 0.1014879 0.0674377 0.0622532 0.0725381 0.0474896 0.0513975 0.0492194 0.0625191 -0.0272101
Score_PSI_03 Score_PSI_03 -0.0037334 -0.0499063 0.0292819 0.0943953 0.0451508 0.0583644 -0.0032584 -0.0287158 -0.0378500 0.0498557 0.0415523 -0.0069743 0.0371596 0.0303815 0.0687216 1.0000000 0.1353085 0.0601750 0.0636661 0.1407342 0.0386211 0.0114365 0.1186788 0.0298580 0.0596798 0.0999683 0.0086745
Score_PSI_04 Score_PSI_04 -0.0449077 -0.0948401 -0.0438108 0.0417007 0.0601831 0.0638412 0.0312780 -0.0995591 -0.1180160 0.0038509 0.2105379 0.1713379 0.2556384 0.2301195 0.2380935 0.1353085 1.0000000 0.0601419 0.0870693 0.1059485 0.0523892 0.0649032 0.0782559 0.0123489 0.0652098 0.1018205 -0.0766302
Score_PSI_06 Score_PSI_06 0.0154891 -0.0054742 -0.0308989 0.0225916 0.0544576 0.0187794 0.0059688 -0.0603629 -0.0262166 0.0505415 0.0885083 0.0478268 0.0679149 0.0543554 0.0878847 0.0601750 0.0601419 1.0000000 0.0724291 0.1014588 0.0516246 0.0351464 0.1431056 0.0509831 0.0527115 0.0910520 0.0456525
Score_PSI_08 Score_PSI_08 -0.0214412 -0.0590998 -0.1637017 -0.0272232 -0.0311579 -0.0806516 -0.0199006 -0.0017558 -0.0522876 0.0577776 0.1010348 0.0397571 0.1051698 0.0884315 0.1014879 0.0636661 0.0870693 0.0724291 1.0000000 0.0052449 -0.0360093 0.0198090 0.0394605 0.0093444 0.0228045 0.0127268 -0.0041983
Score_PSI_09 Score_PSI_09 -0.0182303 0.0707774 -0.0181523 0.0549990 0.0899361 0.0224833 0.0150699 -0.0043396 -0.0577213 0.0540124 0.0889343 0.0429090 0.0707269 0.0217880 0.0674377 0.1407342 0.1059485 0.1014588 0.0052449 1.0000000 0.0885278 0.0680540 0.1732337 0.0519119 0.1207438 0.2197254 -0.0237660
Score_PSI_10 Score_PSI_10 0.0710046 -0.0157615 0.0837980 0.0009222 0.0412713 0.0544528 0.0331569 -0.0382369 -0.0037803 0.0813038 0.1066619 0.0320669 0.0383771 0.0237048 0.0622532 0.0386211 0.0523892 0.0516246 -0.0360093 0.0885278 1.0000000 0.1626632 0.1079488 0.2303938 0.0453739 0.0830134 0.0497447
Score_PSI_11 Score_PSI_11 0.1130121 -0.1615833 0.0455190 -0.0909811 -0.0625676 -0.0076797 -0.0837910 0.0137283 -0.0301775 0.1279724 0.1037006 0.0426574 0.0362529 0.0704445 0.0725381 0.0114365 0.0649032 0.0351464 0.0198090 0.0680540 0.1626632 1.0000000 0.1172504 0.2506376 -0.0093577 0.0464067 0.1441986
Score_PSI_12 Score_PSI_12 0.1047402 -0.0674588 0.0749218 0.1091949 0.0594933 0.1653812 -0.0108546 -0.0380145 -0.0317534 0.1458258 0.0492328 -0.0532586 -0.0300702 0.0089560 0.0474896 0.1186788 0.0782559 0.1431056 0.0394605 0.1732337 0.1079488 0.1172504 1.0000000 0.1742084 0.0522204 0.1358951 0.0655557
Score_PSI_13 Score_PSI_13 0.1193336 -0.1458284 0.0724936 -0.0160408 -0.0250714 0.0993570 -0.0087678 -0.0258097 -0.0440578 0.1334619 0.0467554 0.0026944 -0.0086832 0.0393676 0.0513975 0.0298580 0.0123489 0.0509831 0.0093444 0.0519119 0.2303938 0.2506376 0.1742084 1.0000000 0.0056987 0.0878105 0.0949467
Score_PSI_14 Score_PSI_14 0.0140012 -0.0172231 0.0399935 0.0169512 0.0723872 0.0676972 0.0163779 -0.0166146 0.0086141 0.0498603 0.0454462 0.0734846 0.0647245 0.0464407 0.0492194 0.0596798 0.0652098 0.0527115 0.0228045 0.1207438 0.0453739 -0.0093577 0.0522204 0.0056987 1.0000000 0.1176726 -0.0181150
Score_PSI_15 Score_PSI_15 -0.0158282 0.0304634 0.0144400 0.0430812 0.0625753 0.0374530 0.0419068 -0.0268805 0.0326658 0.0433809 0.0297688 0.0340007 0.0342374 0.0029691 0.0625191 0.0999683 0.1018205 0.0910520 0.0127268 0.2197254 0.0830134 0.0464067 0.1358951 0.0878105 0.1176726 1.0000000 -0.0467071
Payment_PAYM_90_HIP_KNEE Payment_PAYM_90_HIP_KNEE 0.2975679 -0.2108956 0.0053946 -0.0627505 -0.0720431 -0.0241965 -0.0815209 -0.0048449 -0.1256363 0.3410864 0.0591548 -0.0406696 -0.0350247 -0.0062985 -0.0272101 0.0086745 -0.0766302 0.0456525 -0.0041983 -0.0237660 0.0497447 0.1441986 0.0655557 0.0949467 -0.0181150 -0.0467071 1.0000000

Imputation and Handling of Missing Values

# Change - to _ in HIP-KNEE
colnames(HipKneeTest) <- gsub("-", "_", colnames(HipKneeTest))

# Remove all NA values in target variable "PredictedReadmissionRate_HIP_KNEE"
HipKneeTest <- HipKneeTest %>% filter(!is.na(PredictedReadmissionRate_HIP_KNEE))

# Remove all NA values in the "State", "StateCode", and "FacilityName" columns
HipKneeTest <- HipKneeTest %>% drop_na(State, StateCode, FacilityName)


# Print number of remaining variables and observations
dimensions <- dim(HipKneeTest)
cat("Number of variables:", dimensions[2], "\n")
## Number of variables: 31
cat("Number of observations:", dimensions[1], "\n")
## Number of observations: 1833
# Calculate missing values
missing_values_summary <- HipKneeTest %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  mutate(Missing_Percentage = (Missing_Count / nrow(HipKneeTest)) * 100)

# Print table
missing_values_summary %>%
  kable(caption = "Table 7. Missing Values Summary") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
Table 7. Missing Values Summary
Variable Missing_Count Missing_Percentage
FacilityId 0 0.0000000
PredictedReadmissionRate_HIP_KNEE 0 0.0000000
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE 33 1.8003273
EDV 90 4.9099836
HCP_COVID_19 16 0.8728860
IMM_3 16 0.8728860
OP_18b 75 4.0916530
OP_29 222 12.1112930
SAFE_USE_OF_OPIOIDS 69 3.7643208
VTE_1 994 54.2280415
Score_COMP_HIP_KNEE 40 2.1822149
Score_MORT_30_AMI 405 22.0949264
Score_MORT_30_COPD 247 13.4751773
Score_MORT_30_HF 141 7.6923077
Score_MORT_30_PN 125 6.8194217
Score_MORT_30_STK 284 15.4937261
Score_PSI_03 8 0.4364430
Score_PSI_04 575 31.3693399
Score_PSI_06 2 0.1091107
Score_PSI_08 2 0.1091107
Score_PSI_09 2 0.1091107
Score_PSI_10 41 2.2367703
Score_PSI_11 40 2.1822149
Score_PSI_12 2 0.1091107
Score_PSI_13 42 2.2913257
Score_PSI_14 87 4.7463175
Score_PSI_15 29 1.5821058
FacilityName 0 0.0000000
State 0 0.0000000
Payment_PAYM_90_HIP_KNEE 42 2.2913257
StateCode 0 0.0000000

Impute variables with low percentage missingness (<5%) by the median for numeric variables and mode for categorical variables

# Calculate median for columns with <5% missing values
numeric_vars_low_missing <- c("HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE", "EDV", "HCP_COVID_19", "IMM_3", "OP_18b", "SAFE_USE_OF_OPIOIDS", "Score_COMP_HIP_KNEE", "Score_PSI_03", "Score_PSI_06", "Score_PSI_08", "Score_PSI_09", "Score_PSI_10", "Score_PSI_11", "Score_PSI_12", "Score_PSI_13", "Score_PSI_14", "Score_PSI_15", "Payment_PAYM_90_HIP_KNEE")

for (var in numeric_vars_low_missing) {
  HipKneeTest[[var]][is.na(HipKneeTest[[var]])] <- median(HipKneeTest[[var]], na.rm = TRUE)
}

Impute high percentage missingness variables (>5%) using KNN

# Select high missingness variables for KNN imputation
vars_for_knn <- c("VTE_1", "Score_MORT_30_AMI", "Score_MORT_30_COPD", "Score_MORT_30_HF", "Score_MORT_30_PN", "Score_MORT_30_STK", "Score_PSI_04", "OP_29")

# Perform KNN imputation
HipKneeTest_knn <- kNN(HipKneeTest, variable = vars_for_knn, k = 5)

# Remove columns created by the KNN function
HipKneeTest_knn <- HipKneeTest_knn %>% select(-ends_with("_imp"))

# Update HipKneeTrain with imputed values
HipKneeTest[vars_for_knn] <- HipKneeTest_knn[vars_for_knn]
# Calculate missing values
missing_values_summary <- HipKneeTest %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  mutate(Missing_Percentage = (Missing_Count / nrow(HipKneeTest)) * 100)

# Print table
missing_values_summary %>%
  kable(caption = "Table 7. Missing Values Summary") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
Table 7. Missing Values Summary
Variable Missing_Count Missing_Percentage
FacilityId 0 0
PredictedReadmissionRate_HIP_KNEE 0 0
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE 0 0
EDV 0 0
HCP_COVID_19 0 0
IMM_3 0 0
OP_18b 0 0
OP_29 0 0
SAFE_USE_OF_OPIOIDS 0 0
VTE_1 0 0
Score_COMP_HIP_KNEE 0 0
Score_MORT_30_AMI 0 0
Score_MORT_30_COPD 0 0
Score_MORT_30_HF 0 0
Score_MORT_30_PN 0 0
Score_MORT_30_STK 0 0
Score_PSI_03 0 0
Score_PSI_04 0 0
Score_PSI_06 0 0
Score_PSI_08 0 0
Score_PSI_09 0 0
Score_PSI_10 0 0
Score_PSI_11 0 0
Score_PSI_12 0 0
Score_PSI_13 0 0
Score_PSI_14 0 0
Score_PSI_15 0 0
FacilityName 0 0
State 0 0
Payment_PAYM_90_HIP_KNEE 0 0
StateCode 0 0

Feature Engineer Mortality Data

# Average death rates amongst mortality variables and create new column "Score_Ovr_MORT"
HipKneeTest$Score_Ovr_MORT <- rowMeans(HipKneeTest[, c("Score_MORT_30_AMI", 
                                                         "Score_MORT_30_COPD", 
                                                         "Score_MORT_30_HF", 
                                                         "Score_MORT_30_PN", 
                                                         "Score_MORT_30_STK")], 
                                                          na.rm = TRUE)

# Remove old mortality columns
HipKneeTest <- HipKneeTest[, !(names(HipKneeTest) %in% c("Score_MORT_30_AMI", 
                                                            "Score_MORT_30_COPD",
                                                            "Score_MORT_30_HF", 
                                                            "Score_MORT_30_PN", 
                                                            "Score_MORT_30_STK"))]

Reassess heatmap with engineered mortality data

# Compute correlation matrix
cor_matrix <- cor(HipKneeTest %>% select_if(is.numeric), use = "pairwise.complete.obs")

# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)

# Plot heatmap
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Figure 5. Correlation Heatmap of Numeric Variables")

Save the data for future ease of use

save(HipKneeTest, file = "HipKneeTest.RData")

Descriptive Statistics

# Create a summary table of descriptive statistics
descr_stats <- describe(HipKneeTrain)
# Remove the rows with Facility ID, State and State code, and facility name
descr_stats <- descr_stats %>% filter(vars != c(1, 23, 24, 26))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `vars != c(1, 23, 24, 26)`.
## Caused by warning in `vars != c(1, 23, 24, 26)`:
## ! longer object length is not a multiple of shorter object length
# Remove columns 1, 2, 5, and 6
descr_stats <- descr_stats[, -c(1, 2, 5, 6)]

# Create a table with kable
kable(descr_stats, format = "html", caption = "Descriptive Statistics for All Numeric Variables in Final Dataset") %>%
  kable_styling(
    bootstrap_options = c("hover", "striped", "responsive")
  ) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, width = "5em") %>%
  row_spec(0, bold = TRUE, background = "#f2f2f2")
Descriptive Statistics for All Numeric Variables in Final Dataset
mean sd mad min max range skew kurtosis se
PredictedReadmissionRate_HIP_KNEE 4.546561e+00 0.9093914 0.8576841 1.9279 8.569 6.6411 0.4373061 0.4908058 0.0212407
HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE 8.672450e+01 3.7230744 2.9652000 66.0000 98.000 32.0000 -0.5851721 1.5050011 0.0869602
EDV 2.694490e+00 1.0250451 1.4826000 1.0000 4.000 3.0000 -0.1385316 -1.1578643 0.0239421
HCP_COVID_19 8.887098e+01 9.5606361 8.3025600 32.6000 100.000 67.4000 -1.3148000 2.1219028 0.2233087
IMM_3 7.932570e+01 17.7478481 16.3086000 4.0000 100.000 96.0000 -1.0958217 0.7241943 0.4145381
OP_18b 1.895739e+02 49.6783664 44.4780000 62.0000 587.000 525.0000 0.9817063 3.3080181 1.1603422
OP_29 9.215876e+01 11.7440127 4.4478000 0.0000 100.000 100.0000 -3.3057142 14.4641144 0.2743060
SAFE_USE_OF_OPIOIDS 1.557229e+01 4.2502679 2.9652000 0.0000 45.000 45.0000 0.5890847 3.1474800 0.0992739
VTE_1 9.107365e+01 9.1561257 4.4478000 5.0000 100.000 95.0000 -4.3773341 29.2741519 0.2138605
Score_COMP_HIP_KNEE 3.179924e+00 0.5539759 0.4447800 1.6000 6.200 4.6000 0.7280978 1.7214783 0.0129393
Score_PSI_03 5.878123e-01 0.5183737 0.2816940 0.0500 6.310 6.2600 3.5027779 21.8351982 0.0121077
Score_PSI_04 1.688142e+02 19.0239898 15.6117780 86.6800 241.810 155.1300 -0.0656641 1.2542628 0.4443451
Score_PSI_06 2.467703e-01 0.0452557 0.0296520 0.1200 0.510 0.3900 0.9085914 1.9551905 0.0010570
Score_PSI_08 8.994540e-02 0.0082004 0.0000000 0.0600 0.130 0.0700 0.5502254 1.7580966 0.0001915
Score_PSI_09 2.514146e+00 0.4953103 0.3409980 1.1000 6.100 5.0000 1.2648156 4.8211377 0.0115690
Score_PSI_10 1.572951e+00 0.3731798 0.1482600 0.4700 4.550 4.0800 1.8287437 7.4056616 0.0087164
Score_PSI_11 8.969864e+00 3.1123203 2.2239000 2.7300 49.000 46.2700 2.7479837 22.3393252 0.0726947
Score_PSI_12 3.582755e+00 0.7944605 0.6671700 1.6100 7.510 5.9000 0.9512312 1.6446910 0.0185563
Score_PSI_13 5.287103e+00 1.0312666 0.7413000 2.1700 10.790 8.6200 1.0438233 2.7498825 0.0240874
Score_PSI_14 2.010540e+00 0.3569335 0.1927380 1.0700 4.400 3.3300 1.9129489 6.4857308 0.0083369
Score_PSI_15 1.102668e+00 0.3271882 0.2223900 0.3500 3.430 3.0800 1.7127063 5.4101230 0.0076422
FacilityName* 9.026263e+02 517.8533090 667.1700000 1.0000 1796.000 1795.0000 -0.0093756 -1.2029068 12.0955473
State* 2.469449e+01 14.4448000 20.7564000 1.0000 50.000 49.0000 0.0222112 -1.3657016 0.3373885
Payment_PAYM_90_HIP_KNEE 2.105666e+04 1943.7230953 1716.8508000 15936.0000 34916.000 18980.0000 0.7645293 1.9173982 45.3997188
StateCode* 2.472995e+01 14.5368326 20.7564000 1.0000 50.000 49.0000 0.0160810 -1.3497989 0.3395381
Score_Ovr_MORT 1.305795e+01 1.1615024 1.0378200 8.1600 17.420 9.2600 -0.0667113 0.5511229 0.0271293
# Select numeric columns
numeric_columns <- HipKneeTrain %>% select_if(is.numeric)

# Melt the data for easier plotting with ggplot2
numeric_melted <- melt(numeric_columns)
## No id variables; using all as measure variables
# Create histograms
ggplot(numeric_melted, aes(x = value)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  facet_wrap(~variable, scales = "free_x") +
  theme_minimal() +
  labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")

Appendix 3: Segmentation Analysis Code

k-means Clustering (SE)

# Select numeric columns for clustering
numeric_columns <- HipKneeTrain %>% select_if(is.numeric)

# Standardize features
X_scaled <- scale(numeric_columns)

# Determine optimal number of clusters using elbow plot
set.seed(123)
elbow_plot <- fviz_nbclust(X_scaled, kmeans, method = "wss", k.max = 10) +
  labs(title = "Elbow Plot for Optimal k")

print(elbow_plot)

# Optimal K = 3
optimal_k <- 3
kmeans_result <- kmeans(X_scaled, centers = optimal_k, nstart = 25)

# Create a new df for K-Means Clustering results
HipKneeTrain_K_Means <- HipKneeTrain %>%
  mutate(Cluster = as.factor(kmeans_result$cluster))

# Visualize clusters
fviz_cluster(kmeans_result, data = X_scaled,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())

# Cluster characteristics
cluster_summary <- HipKneeTrain_K_Means %>%
  group_by(Cluster) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)

print(cluster_summary)
## # A tibble: 3 × 24
##   Cluster PredictedReadmission…¹ HcahpsLinearMeanValu…²   EDV HCP_COVID_19 IMM_3
##   <fct>                    <dbl>                  <dbl> <dbl>        <dbl> <dbl>
## 1 1                         4.73                   85.0  2.51         80.6  63.3
## 2 2                         4.19                   88.2  2.53         91.8  85.8
## 3 3                         4.98                   85.9  3.19         92.4  84.7
## # ℹ abbreviated names: ¹​PredictedReadmissionRate_HIP_KNEE,
## #   ²​HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE
## # ℹ 18 more variables: OP_18b <dbl>, OP_29 <dbl>, SAFE_USE_OF_OPIOIDS <dbl>,
## #   VTE_1 <dbl>, Score_COMP_HIP_KNEE <dbl>, Score_PSI_03 <dbl>,
## #   Score_PSI_04 <dbl>, Score_PSI_06 <dbl>, Score_PSI_08 <dbl>,
## #   Score_PSI_09 <dbl>, Score_PSI_10 <dbl>, Score_PSI_11 <dbl>,
## #   Score_PSI_12 <dbl>, Score_PSI_13 <dbl>, Score_PSI_14 <dbl>, …
# Visualize feature distributions across clusters
features_to_plot <- c("PredictedReadmissionRate_HIP_KNEE", "HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE", "Score_COMP_HIP_KNEE", "SAFE_USE_OF_OPIOIDS")

for (feature in features_to_plot) {
  p <- ggplot(HipKneeTrain_K_Means, aes(x = Cluster, y = .data[[feature]], fill = Cluster)) +
    geom_boxplot() +
    theme_minimal() +
    labs(title = paste("Distribution of", feature, "across clusters"))
  print(p)
}

Hierarchical Clustering (SE)

# Perform PCA
pca_result <- prcomp(X_scaled, center = TRUE, scale. = TRUE)

# Visualize variance
fviz_eig(pca_result, addlabels = TRUE)

# Factor map
fviz_pca_var(pca_result, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

# PC scores, three components
pca_scores <- as.data.frame(pca_result$x[, 1:3])

# Store PCA results in a new dataframe
HipKneeTrain_PCA <- HipKneeTrain %>%
  select_if(is.numeric) %>%
  bind_cols(pca_scores)

# Hierarchical Clustering

# Compute distance matrix
dist_matrix <- dist(X_scaled, method = "euclidean")

# Perform hierarchical clustering
hc_result <- hclust(dist_matrix, method = "ward.D2")

# Compute WCSS for different number of clusters
wcss <- sapply(1:10, function(k) {
  clusters <- cutree(hc_result, k)
  cluster_data <- scale(X_scaled)
  tot.withinss <- sum(sapply(unique(clusters), function(c) {
    sum(dist(cluster_data[clusters == c, , drop = FALSE])^2)
  }))
  return(tot.withinss)
})

# Plot WCSS
plot(1:10, wcss, type = "b", xlab = "Number of Clusters", ylab = "WCSS")

# Create clusters with optimal number of clusters from WCSS plot
k <- 3
hc_clusters <- cutree(hc_result, k = k)

# Store hierarchical clustering results in a new dataframe
HipKneeTrain_HC <- HipKneeTrain %>%
  mutate(HC_Cluster = as.factor(hc_clusters))

# Visualize clusters using first three PCs
pca_plot_data <- cbind(pca_scores[, 1:3], Cluster = hc_clusters)
fviz_cluster(list(data = pca_plot_data, cluster = hc_clusters),
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal(),
             main = "Hierarchical Clustering Results (PCA)")

# Analyze cluster characteristics
hc_cluster_summary <- HipKneeTrain_HC %>%
  group_by(HC_Cluster) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)

print(hc_cluster_summary)
## # A tibble: 3 × 24
##   HC_Cluster PredictedReadmissionRat…¹ HcahpsLinearMeanValu…²   EDV HCP_COVID_19
##   <fct>                          <dbl>                  <dbl> <dbl>        <dbl>
## 1 1                               4.82                   85.3  2.51         85.1
## 2 2                               4.24                   88.0  2.62         91.0
## 3 3                               4.72                   86.5  3.22         91.4
## # ℹ abbreviated names: ¹​PredictedReadmissionRate_HIP_KNEE,
## #   ²​HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE
## # ℹ 19 more variables: IMM_3 <dbl>, OP_18b <dbl>, OP_29 <dbl>,
## #   SAFE_USE_OF_OPIOIDS <dbl>, VTE_1 <dbl>, Score_COMP_HIP_KNEE <dbl>,
## #   Score_PSI_03 <dbl>, Score_PSI_04 <dbl>, Score_PSI_06 <dbl>,
## #   Score_PSI_08 <dbl>, Score_PSI_09 <dbl>, Score_PSI_10 <dbl>,
## #   Score_PSI_11 <dbl>, Score_PSI_12 <dbl>, Score_PSI_13 <dbl>, …
# Visualize feature distributions across clusters
features_to_plot <- c("PredictedReadmissionRate_HIP_KNEE", "HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE", "Score_COMP_HIP_KNEE", "SAFE_USE_OF_OPIOIDS")

for (feature in features_to_plot) {
  p <- ggplot(HipKneeTrain_HC, aes_string(x = "HC_Cluster", y = feature, fill = "HC_Cluster")) +
    geom_boxplot() +
    theme_minimal() +
    labs(title = paste("Distribution of", feature, "across Hierarchical Clusters"))
  print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

> This is a preliminary segmentation analysis and we would go forth and tighten/tidy this up some more. However, just initial impression, I’m not sure clustering is entirely beneficial with our dataset. What are your thoughts on our preliminary segmentation analysis? Any ideas in which we could improve our clustering to perhaps be more meaningful?

Appendix 4: Initial Model Code

Random Forest

Creating the model

# Remove unwanted columns from the dataset
HipKneeTrain_RF <- HipKneeTrain %>%
  select(-State, -FacilityName, -FacilityId)

# Define mtry parameter grid
grid <- expand.grid(
  mtry = c(2, 4, 6, 8)
)

# Define CV
train_control <- trainControl(
  method = "cv",         
  number = 7,           
  verboseIter = TRUE     
)

# Set seed 
set.seed(123)

# Train the Random Forest model with grid search
rf_grid_search <- train(
  PredictedReadmissionRate_HIP_KNEE ~ .,   
  data = HipKneeTrain_RF,                    
  method = "rf",                          
  trControl = train_control,              
  tuneGrid = grid,                       
  importance = TRUE,                     
  ntree = 100                         
)
## + Fold1: mtry=2 
## - Fold1: mtry=2 
## + Fold1: mtry=4 
## - Fold1: mtry=4 
## + Fold1: mtry=6 
## - Fold1: mtry=6 
## + Fold1: mtry=8 
## - Fold1: mtry=8 
## + Fold2: mtry=2 
## - Fold2: mtry=2 
## + Fold2: mtry=4 
## - Fold2: mtry=4 
## + Fold2: mtry=6 
## - Fold2: mtry=6 
## + Fold2: mtry=8 
## - Fold2: mtry=8 
## + Fold3: mtry=2 
## - Fold3: mtry=2 
## + Fold3: mtry=4 
## - Fold3: mtry=4 
## + Fold3: mtry=6 
## - Fold3: mtry=6 
## + Fold3: mtry=8 
## - Fold3: mtry=8 
## + Fold4: mtry=2 
## - Fold4: mtry=2 
## + Fold4: mtry=4 
## - Fold4: mtry=4 
## + Fold4: mtry=6 
## - Fold4: mtry=6 
## + Fold4: mtry=8 
## - Fold4: mtry=8 
## + Fold5: mtry=2 
## - Fold5: mtry=2 
## + Fold5: mtry=4 
## - Fold5: mtry=4 
## + Fold5: mtry=6 
## - Fold5: mtry=6 
## + Fold5: mtry=8 
## - Fold5: mtry=8 
## + Fold6: mtry=2 
## - Fold6: mtry=2 
## + Fold6: mtry=4 
## - Fold6: mtry=4 
## + Fold6: mtry=6 
## - Fold6: mtry=6 
## + Fold6: mtry=8 
## - Fold6: mtry=8 
## + Fold7: mtry=2 
## - Fold7: mtry=2 
## + Fold7: mtry=4 
## - Fold7: mtry=4 
## + Fold7: mtry=6 
## - Fold7: mtry=6 
## + Fold7: mtry=8 
## - Fold7: mtry=8 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 8 on full training set
# Best parameters
best_params <- rf_grid_search$bestTune
print(best_params)
##   mtry
## 4    8
# Extract feature importances
best_rf_model <- rf_grid_search$finalModel
feature_importances <- importance(best_rf_model)

# Convert feature importances to a df
feature_importances_df <- as.data.frame(feature_importances)
feature_importances_df$Feature <- rownames(feature_importances_df)

# Sort importances by %IncMSE
sorted_by_inc_mse <- feature_importances_df %>%
  arrange(desc(`%IncMSE`))

# Sort importances by IncNodePurity
sorted_by_inc_node_purity <- feature_importances_df %>%
  arrange(desc(IncNodePurity))

# Print importances
cat("Feature Importances by %IncMSE:\n")
## Feature Importances by %IncMSE:
print(sorted_by_inc_mse)
##                                                     %IncMSE IncNodePurity
## Score_COMP_HIP_KNEE                              9.73571002   114.4625628
## Payment_PAYM_90_HIP_KNEE                         9.62291429   123.8902502
## HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE  7.55562148    72.2391717
## OP_18b                                           5.46985104    68.5356171
## Score_Ovr_MORT                                   5.40776904    60.2425197
## EDV                                              5.17769225    29.8450844
## Score_PSI_13                                     4.90003360    58.6569716
## Score_PSI_14                                     4.88568699    59.2358357
## StateCode009                                     4.83285052    12.2425241
## Score_PSI_15                                     4.29416986    60.6220998
## Score_PSI_10                                     4.01746219    59.5208962
## StateCode022                                     3.78888482     8.4435485
## StateCode017                                     3.75496873     6.4580507
## Score_PSI_03                                     3.57263120    59.4229012
## Score_PSI_04                                     2.89137476    51.8655680
## StateCode025                                     2.75051445     3.9203910
## Score_PSI_06                                     2.65706807    41.7370371
## Score_PSI_12                                     2.49480062    56.8858338
## StateCode032                                     2.38022955     3.5868612
## SAFE_USE_OF_OPIOIDS                              2.37149080    46.2049432
## HCP_COVID_19                                     2.29805126    52.0890212
## VTE_1                                            2.28747454    46.0078517
## StateCode004                                     2.24967556     1.7659648
## StateCode024                                     2.04471716     1.8304484
## Score_PSI_11                                     2.00699983    59.1724094
## StateCode046                                     2.00522157     1.9184269
## StateCode043                                     1.92682199     5.8039853
## StateCode042                                     1.86872359     1.3312563
## StateCode007                                     1.83144333     1.4674657
## StateCode012                                     1.77796848     0.6481818
## OP_29                                            1.68237455    36.6180186
## StateCode005                                     1.67646518     4.4827144
## StateCode045                                     1.55364669     0.2510802
## Score_PSI_09                                     1.49299109    53.4640474
## StateCode041                                     1.41485782     1.4299112
## StateCode031                                     1.28541908     0.2394702
## StateCode019                                     1.21832830     0.7508580
## Score_PSI_08                                     1.19950486    17.6795267
## StateCode006                                     0.91321821     1.6759295
## StateCode034                                     0.88385164     0.2911008
## StateCode035                                     0.80770198     6.1570394
## StateCode016                                     0.70775747     2.2735052
## StateCode026                                     0.65105418     0.5724423
## StateCode050                                     0.52205617     0.3265311
## StateCode023                                     0.52181168     3.2959928
## StateCode014                                     0.51932923     5.9144268
## StateCode028                                     0.34998172     0.7556607
## StateCode003                                     0.27166644     1.0239009
## StateCode040                                     0.27083318     2.4285788
## StateCode015                                     0.25740577     1.0803806
## StateCode030                                     0.14704729     5.1846769
## StateCode044                                     0.10606545     1.6302575
## StateCode010                                     0.09240851     2.5641148
## StateCode008                                     0.00000000     0.0679025
## IMM_3                                           -0.02272136    42.2476506
## StateCode038                                    -0.02739101     3.0902474
## StateCode002                                    -0.07340925     0.8448698
## StateCode027                                    -0.17352773     2.3772938
## StateCode013                                    -0.32461931     4.7749396
## StateCode036                                    -0.71086164     2.2682670
## StateCode039                                    -0.72498337     0.1262152
## StateCode033                                    -0.86699797     2.3972385
## StateCode020                                    -0.92017993     1.7213020
## StateCode018                                    -1.07738001     1.3573963
## StateCode048                                    -1.17500560     0.5437489
## StateCode011                                    -1.21828161     0.6886992
## StateCode047                                    -1.28603857     1.9595272
## StateCode037                                    -1.45654291     0.6948348
## StateCode029                                    -1.91255388     0.5252000
## StateCode049                                    -2.13959539     2.5998141
## StateCode021                                    -2.14531241     1.6954613
##                                                                                         Feature
## Score_COMP_HIP_KNEE                                                         Score_COMP_HIP_KNEE
## Payment_PAYM_90_HIP_KNEE                                               Payment_PAYM_90_HIP_KNEE
## HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE
## OP_18b                                                                                   OP_18b
## Score_Ovr_MORT                                                                   Score_Ovr_MORT
## EDV                                                                                         EDV
## Score_PSI_13                                                                       Score_PSI_13
## Score_PSI_14                                                                       Score_PSI_14
## StateCode009                                                                       StateCode009
## Score_PSI_15                                                                       Score_PSI_15
## Score_PSI_10                                                                       Score_PSI_10
## StateCode022                                                                       StateCode022
## StateCode017                                                                       StateCode017
## Score_PSI_03                                                                       Score_PSI_03
## Score_PSI_04                                                                       Score_PSI_04
## StateCode025                                                                       StateCode025
## Score_PSI_06                                                                       Score_PSI_06
## Score_PSI_12                                                                       Score_PSI_12
## StateCode032                                                                       StateCode032
## SAFE_USE_OF_OPIOIDS                                                         SAFE_USE_OF_OPIOIDS
## HCP_COVID_19                                                                       HCP_COVID_19
## VTE_1                                                                                     VTE_1
## StateCode004                                                                       StateCode004
## StateCode024                                                                       StateCode024
## Score_PSI_11                                                                       Score_PSI_11
## StateCode046                                                                       StateCode046
## StateCode043                                                                       StateCode043
## StateCode042                                                                       StateCode042
## StateCode007                                                                       StateCode007
## StateCode012                                                                       StateCode012
## OP_29                                                                                     OP_29
## StateCode005                                                                       StateCode005
## StateCode045                                                                       StateCode045
## Score_PSI_09                                                                       Score_PSI_09
## StateCode041                                                                       StateCode041
## StateCode031                                                                       StateCode031
## StateCode019                                                                       StateCode019
## Score_PSI_08                                                                       Score_PSI_08
## StateCode006                                                                       StateCode006
## StateCode034                                                                       StateCode034
## StateCode035                                                                       StateCode035
## StateCode016                                                                       StateCode016
## StateCode026                                                                       StateCode026
## StateCode050                                                                       StateCode050
## StateCode023                                                                       StateCode023
## StateCode014                                                                       StateCode014
## StateCode028                                                                       StateCode028
## StateCode003                                                                       StateCode003
## StateCode040                                                                       StateCode040
## StateCode015                                                                       StateCode015
## StateCode030                                                                       StateCode030
## StateCode044                                                                       StateCode044
## StateCode010                                                                       StateCode010
## StateCode008                                                                       StateCode008
## IMM_3                                                                                     IMM_3
## StateCode038                                                                       StateCode038
## StateCode002                                                                       StateCode002
## StateCode027                                                                       StateCode027
## StateCode013                                                                       StateCode013
## StateCode036                                                                       StateCode036
## StateCode039                                                                       StateCode039
## StateCode033                                                                       StateCode033
## StateCode020                                                                       StateCode020
## StateCode018                                                                       StateCode018
## StateCode048                                                                       StateCode048
## StateCode011                                                                       StateCode011
## StateCode047                                                                       StateCode047
## StateCode037                                                                       StateCode037
## StateCode029                                                                       StateCode029
## StateCode049                                                                       StateCode049
## StateCode021                                                                       StateCode021
cat("\nFeature Importances by IncNodePurity:\n")
## 
## Feature Importances by IncNodePurity:
print(sorted_by_inc_node_purity)
##                                                     %IncMSE IncNodePurity
## Payment_PAYM_90_HIP_KNEE                         9.62291429   123.8902502
## Score_COMP_HIP_KNEE                              9.73571002   114.4625628
## HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE  7.55562148    72.2391717
## OP_18b                                           5.46985104    68.5356171
## Score_PSI_15                                     4.29416986    60.6220998
## Score_Ovr_MORT                                   5.40776904    60.2425197
## Score_PSI_10                                     4.01746219    59.5208962
## Score_PSI_03                                     3.57263120    59.4229012
## Score_PSI_14                                     4.88568699    59.2358357
## Score_PSI_11                                     2.00699983    59.1724094
## Score_PSI_13                                     4.90003360    58.6569716
## Score_PSI_12                                     2.49480062    56.8858338
## Score_PSI_09                                     1.49299109    53.4640474
## HCP_COVID_19                                     2.29805126    52.0890212
## Score_PSI_04                                     2.89137476    51.8655680
## SAFE_USE_OF_OPIOIDS                              2.37149080    46.2049432
## VTE_1                                            2.28747454    46.0078517
## IMM_3                                           -0.02272136    42.2476506
## Score_PSI_06                                     2.65706807    41.7370371
## OP_29                                            1.68237455    36.6180186
## EDV                                              5.17769225    29.8450844
## Score_PSI_08                                     1.19950486    17.6795267
## StateCode009                                     4.83285052    12.2425241
## StateCode022                                     3.78888482     8.4435485
## StateCode017                                     3.75496873     6.4580507
## StateCode035                                     0.80770198     6.1570394
## StateCode014                                     0.51932923     5.9144268
## StateCode043                                     1.92682199     5.8039853
## StateCode030                                     0.14704729     5.1846769
## StateCode013                                    -0.32461931     4.7749396
## StateCode005                                     1.67646518     4.4827144
## StateCode025                                     2.75051445     3.9203910
## StateCode032                                     2.38022955     3.5868612
## StateCode023                                     0.52181168     3.2959928
## StateCode038                                    -0.02739101     3.0902474
## StateCode049                                    -2.13959539     2.5998141
## StateCode010                                     0.09240851     2.5641148
## StateCode040                                     0.27083318     2.4285788
## StateCode033                                    -0.86699797     2.3972385
## StateCode027                                    -0.17352773     2.3772938
## StateCode016                                     0.70775747     2.2735052
## StateCode036                                    -0.71086164     2.2682670
## StateCode047                                    -1.28603857     1.9595272
## StateCode046                                     2.00522157     1.9184269
## StateCode024                                     2.04471716     1.8304484
## StateCode004                                     2.24967556     1.7659648
## StateCode020                                    -0.92017993     1.7213020
## StateCode021                                    -2.14531241     1.6954613
## StateCode006                                     0.91321821     1.6759295
## StateCode044                                     0.10606545     1.6302575
## StateCode007                                     1.83144333     1.4674657
## StateCode041                                     1.41485782     1.4299112
## StateCode018                                    -1.07738001     1.3573963
## StateCode042                                     1.86872359     1.3312563
## StateCode015                                     0.25740577     1.0803806
## StateCode003                                     0.27166644     1.0239009
## StateCode002                                    -0.07340925     0.8448698
## StateCode028                                     0.34998172     0.7556607
## StateCode019                                     1.21832830     0.7508580
## StateCode037                                    -1.45654291     0.6948348
## StateCode011                                    -1.21828161     0.6886992
## StateCode012                                     1.77796848     0.6481818
## StateCode026                                     0.65105418     0.5724423
## StateCode048                                    -1.17500560     0.5437489
## StateCode029                                    -1.91255388     0.5252000
## StateCode050                                     0.52205617     0.3265311
## StateCode034                                     0.88385164     0.2911008
## StateCode045                                     1.55364669     0.2510802
## StateCode031                                     1.28541908     0.2394702
## StateCode039                                    -0.72498337     0.1262152
## StateCode008                                     0.00000000     0.0679025
##                                                                                         Feature
## Payment_PAYM_90_HIP_KNEE                                               Payment_PAYM_90_HIP_KNEE
## Score_COMP_HIP_KNEE                                                         Score_COMP_HIP_KNEE
## HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE
## OP_18b                                                                                   OP_18b
## Score_PSI_15                                                                       Score_PSI_15
## Score_Ovr_MORT                                                                   Score_Ovr_MORT
## Score_PSI_10                                                                       Score_PSI_10
## Score_PSI_03                                                                       Score_PSI_03
## Score_PSI_14                                                                       Score_PSI_14
## Score_PSI_11                                                                       Score_PSI_11
## Score_PSI_13                                                                       Score_PSI_13
## Score_PSI_12                                                                       Score_PSI_12
## Score_PSI_09                                                                       Score_PSI_09
## HCP_COVID_19                                                                       HCP_COVID_19
## Score_PSI_04                                                                       Score_PSI_04
## SAFE_USE_OF_OPIOIDS                                                         SAFE_USE_OF_OPIOIDS
## VTE_1                                                                                     VTE_1
## IMM_3                                                                                     IMM_3
## Score_PSI_06                                                                       Score_PSI_06
## OP_29                                                                                     OP_29
## EDV                                                                                         EDV
## Score_PSI_08                                                                       Score_PSI_08
## StateCode009                                                                       StateCode009
## StateCode022                                                                       StateCode022
## StateCode017                                                                       StateCode017
## StateCode035                                                                       StateCode035
## StateCode014                                                                       StateCode014
## StateCode043                                                                       StateCode043
## StateCode030                                                                       StateCode030
## StateCode013                                                                       StateCode013
## StateCode005                                                                       StateCode005
## StateCode025                                                                       StateCode025
## StateCode032                                                                       StateCode032
## StateCode023                                                                       StateCode023
## StateCode038                                                                       StateCode038
## StateCode049                                                                       StateCode049
## StateCode010                                                                       StateCode010
## StateCode040                                                                       StateCode040
## StateCode033                                                                       StateCode033
## StateCode027                                                                       StateCode027
## StateCode016                                                                       StateCode016
## StateCode036                                                                       StateCode036
## StateCode047                                                                       StateCode047
## StateCode046                                                                       StateCode046
## StateCode024                                                                       StateCode024
## StateCode004                                                                       StateCode004
## StateCode020                                                                       StateCode020
## StateCode021                                                                       StateCode021
## StateCode006                                                                       StateCode006
## StateCode044                                                                       StateCode044
## StateCode007                                                                       StateCode007
## StateCode041                                                                       StateCode041
## StateCode018                                                                       StateCode018
## StateCode042                                                                       StateCode042
## StateCode015                                                                       StateCode015
## StateCode003                                                                       StateCode003
## StateCode002                                                                       StateCode002
## StateCode028                                                                       StateCode028
## StateCode019                                                                       StateCode019
## StateCode037                                                                       StateCode037
## StateCode011                                                                       StateCode011
## StateCode012                                                                       StateCode012
## StateCode026                                                                       StateCode026
## StateCode048                                                                       StateCode048
## StateCode029                                                                       StateCode029
## StateCode050                                                                       StateCode050
## StateCode034                                                                       StateCode034
## StateCode045                                                                       StateCode045
## StateCode031                                                                       StateCode031
## StateCode039                                                                       StateCode039
## StateCode008                                                                       StateCode008

Assessing Random Forest Performance

# Remove columns from the test set to match train set
HipKneeTest_RF <- HipKneeTest %>%
  select(-State, -FacilityName, -FacilityId)

# Make predictions on test set
rf_predictions <- predict(rf_grid_search, newdata = HipKneeTest_RF)

# Actual values
actual_values <- HipKneeTest$PredictedReadmissionRate_HIP_KNEE

# Calculate RMSE
mse <- mean((rf_predictions - actual_values)^2)
rmse <- sqrt(mse)

# Calculate R-squared
ss_total <- sum((actual_values - mean(actual_values))^2)
ss_residual <- sum((rf_predictions - actual_values)^2)
r_squared <- 1 - (ss_residual / ss_total)

# Print RMSE and R-squared
cat("RMSE on test set:\n")
## RMSE on test set:
print(rmse)
## [1] 0.358101
cat("\nR-squared on test set:\n")
## 
## R-squared on test set:
print(r_squared)
## [1] 0.8448519

Testing assumptions

# Calculate residuals
residuals_rf <- actual_values - rf_predictions

# Residuals vs Fitted Values plot
ggplot(data = NULL, aes(x = rf_predictions, y = residuals_rf)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Histogram of residuals
ggplot(data = NULL, aes(x = residuals_rf)) +
  geom_histogram(binwidth = 0.1, fill = "blue", alpha = 0.7, boundary = 0) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Frequency") +
  theme_minimal()

# QQ plot of residuals
qqnorm(residuals_rf, main = "QQ Plot of Residuals")
qqline(residuals_rf, col = "red")

# Perform Durbin-Watson test for autocorrelation in residuals
dw_test_result <- dwtest(lm(residuals_rf ~ rf_predictions))
print(dw_test_result)
## 
##  Durbin-Watson test
## 
## data:  lm(residuals_rf ~ rf_predictions)
## DW = 1.7717, p-value = 4.947e-07
## alternative hypothesis: true autocorrelation is greater than 0